The present disclosure relates generally to computing and/or memory architectures and, more specifically, to robust error detection and correction in computing and/or memory architectures.
Various techniques are known for error detection and correction in computing systems. In data storage applications, error detection and correction codes may be used to improve the reliability of data storage media. For example, some file formats include a checksum, such as CRC32, to detect corruption and truncation and can employ redundancy and/or parity files to recover portions of corrupted data. Additionally, Reed/Solomon codes (or any other type of error correcting code) may be used to correct some errors, and storage media may use CRC codes to detect and Reed/Solomon codes to correct minor errors, such as errors in sector reads when using a hard disk drive, for example. In some applications, solid state memory may provide increased protection against soft errors by employing error correcting codes. Such memory may be used in applications having harsh environmental conditions or applications that have little or no margin for errors in data. For example, in a space environment, radiation effects may require that various electronic designs be capable of high-reliability even in the event of radiation effects on the electronic systems.
For example, radiation effects on electronics systems in a space environment may induce one or more types of errors in electronic components. Single event type errors can occur at any point in the mission duration. Such radiation effects include single event upset (SEU), multiple bit upset (MBU), single event functional interrupt (SEFI), and single event transient (SET) errors. SEU, MBU, SEFI, and SET generally require mitigation at the board or system level. Some classes of these errors may require ground intervention. In any event, high reliability systems to be used in such applications may be required to continue operation after such events with little or no external intervention.
Methods, systems, and devices for error detection and correction are provided, Error correction and detection may be performed across multiple dimensions of memory storage, such as across two or more complete memory devices, as well as within individual pages of memory within a single memory device. Error correction and detection performed across two or more complete memory devices may mitigate single event functional interrupts that affect a complete memory device. Error detection and correction performed within individual pages of memory may be used to mitigate single event upset induced single and multiple bit flips within a page of a memory device. A parallel block code, such as a parallel block error correcting code, may be used for error correction and detection performed across two or more complete memory devices. A serial block code, such as a serial block error correcting code, may be used for error correction and detection within individual pages of memory within a single memory device. According to various aspects, parallel block codes also may be used for error correction and detection within individual pages of memory within a memory device.
According to one set of embodiments, a processing system is provided that includes a processor module; a memory module coupled to the processor module comprising a plurality of memory devices, each of the memory devices configured to store data in a predefined plurality of memory pages within the device; and an error detection and correction module coupled with the processor module and memory module and configured to perform first error detection and correction encoding on data to be stored across a plurality of the memory devices and second error detection and correction encoding of data to be stored within pages of data to be stored within one or more of the plurality of memory devices. The first error detection and correction may be performed using a parallel block code encoded across the plurality of memory devices. The second error detection and correction may be performed using a serial block code encoded in the plurality of pages within the one or more memory devices. Serial or parallel block codes that may be used may include any suitable type of error correcting code, such as, for example, Reed-Solomon, Hamming, cyclic error-correcting codes such as BCH, forward error correction codes such a's turbo codes, low density parity check (LDPC) codes, and triple majority voting (TMV), etc. According to various embodiments, the order in which the error detection and correction using serial or parallel block codes may be order independent, and either a parallel or serial block code may be used across the plurality of memory devices, and the other of a serial or parallel block code may be encoded in the plurality of pages within the one or more memory devices. In some embodiments, serial or parallel block encoded data is stored within each of the subset of memory devices in spare memory storage at the end of each memory page.
The first error detection and correction encoding may be configured to mitigate single event functional interrupts that affect a complete memory device, and the second error detection and correction encoding may configured to mitigate single event upset induced single and multiple bit flips within a page of a memory device. The plurality of memory devices may comprise, for example, one or more arrays of flash-based memory devices. According to various examples, other types of memory may be used, such as, for example, (1) NAND and NOR Flash memory including single level and multi-level cells, (2) Ferroelectric RAM (FeRAM, F-RAM, FRAM), (3) Magnetoresistive RAM (MRAM) including memories based on spin torque transfer (STT), (4) Phase-change RAM (PRAM), (5) memristor based memory, (6) Silicon-oxide-nitride-oxide-silicon (SONOS), (7) Resistive RAM (RRAM, ReRAM), (8) Programmable metallization cell (PMC) including conductive-bridging RAM (CBRAM) also known as electrolydic memory, (9) Carbon-nanotube RAM (CNT RAM), (10) Phase-change memory (PRAM, PCRAM, Chalcogenide RAM, C-RAM, CRAM), (11) Dynamic RAM (DRAM) including thyristor RAM (T-RAM), and/or (12) Static RAM (SRAM). The first and second error detection and corrections may be configured to mitigate space radiation effects on the plurality of memory devices.
According to other sets of embodiments, methods for error detection and correction are provided. Exemplary methods may include receiving data to be stored in a memory module, the memory module comprising a plurality of memory devices, each of the memory devices configured to store data in a predefined plurality of memory pages within the device; firstly encoding data to be stored across a plurality of the memory devices according to a first error detection and correction code; and secondly encoding data to be stored in one or more pages of data within one or more of the plurality of memory devices according to a second error detection and correction code. Methods according to various embodiments may also include storing the firstly encoded data in a predefined location in one or more of the memory devices; and storing the secondly encoded data at the end of each respective memory page in which the data is stored. The first error detection and correction code may include parallel block code encoded across the plurality of memory devices. The second error detection and correction code may include serial block code for encoding of data stored within a page of data within the one or more memory devices. According to some embodiments, the first error detection and correction code may include serial block code encoded across the plurality of memory devices, and the second error detection and correction code may include parallel block code for encoding of data stored within a page of data within the one or more memory devices. The first error detection and correction code may be used to mitigate single fault functional interrupts that affect a complete memory device, and the second error detection and correction code may be used to mitigate single event upset induced single and multiple bit flips within a page of a memory device.
The foregoing has outlined rather broadly the features and technical advantages of examples according to the disclosure in order that the detailed description that follows may be better understood. Additional features and advantages will be described hereinafter. The conception and specific examples disclosed may be readily utilized as a basis for modifying or designing other structures for carrying out the same purposes of the present disclosure. Such equivalent constructions do not depart from the spirit and scope of the appended claims. Features which are believed to be characteristic of the concepts disclosed herein, both as to their organization and method of operation, together with associated advantages will be better understood from the following description when considered in connection with the accompanying figures. Each of the figures is provided for the purpose of illustration and description only, and not as a definition of the limits of the claims.
A further understanding of the nature and advantages of the present invention may be realized by reference to the following drawings. In the appended figures, similar components or features may have the same reference label. Further, various components of the same type may be distinguished by following the reference label by a dash and a second label that distinguishes among the similar components. If only the first reference label is used in the specification, the description is applicable to any one of the similar components having the same first reference label irrespective of the second reference label.
Methods, systems, and devices for error detection and correction are provided. Error correction and detection may be performed across multiple dimensions of memory storage, such as across two or more complete memory devices, as well as within individual pages of memory within a single memory device. Error correction and detection performed across two or more complete memory devices may mitigate single event functional interrupts that affect a complete memory device. Error detection and correction performed within individual pages of memory may be used to mitigate single event upset induced single and multiple bit flips within a page of a memory device. A parallel block code, such as a parallel block Reed-Solomon code, may be used for error correction and detection performed across two or more complete memory devices. A serial block code, such as a serial block Reed-Solomon code, may be used for error correction and detection within individual pages of memory within a single memory device. Serial or parallel block codes that may be used may include any suitable type of error correcting code, such as, for example, Reed-Solomon, Hamming, cyclic error-correcting codes such as BCH, forward error correction codes such as turbo codes, low density parity check (LDPC) codes, and triple majority voting (TMV), etc. Such multi-dimensional error detection and correction may be used for the mitigation of space radiation effects in a satellite system, for example. Such error correction and detection may also be used in other applications that require a highly fault-tolerant system.
Thus, the following description provides examples, and is not limiting of the scope, applicability, or configuration set forth in the claims. Changes may be made in the function and arrangement of elements discussed without departing from the spirit and scope of the disclosure. Various embodiments may omit, substitute, or add various procedures or components as appropriate. For instance, the methods described may be performed in an order different from that described, and various steps may be added, omitted, or combined. Also, features described with respect to certain embodiments may be combined in other embodiments.
Referring first to
According to various embodiments, system 100 may withstand one or more faults and continue uninterrupted operations. Faults can arise from numerous sources in a particular application environment, such as from the interaction of ionizing radiation with one or more of the processors or memories. In particular, faults can arise from the interaction of ionizing radiation with electronic components, such as processors, controllers, and/or memories, in the space environment. It should be appreciated that ionizing radiation can also arise in other ways, for example, from impurities in solder used in the assembly of electronic components and circuits containing electronic components. These impurities typically cause a very small fraction (e.g., <<1%) of the error rate observed in space radiation environments. Additionally, memory components may have random bit flips that may result in a fault or data corruption if not corrected.
With respect to radiation effects, these effects may induce one or more types of errors in electronic components, and may occur at any point in the mission duration. Such radiation effects include single event upset (SEU), multiple bit upset (MBU), single event functional interrupt (SEFI), and single event transient (SET) errors. SEU, MBU, SEFI and SET can require mitigation at the board and/or system level. Memory and processing systems of the processing/memory module 120, according to various embodiments, are configured to perform multi-dimensional error detection and correction for data stored in memory, and thereby mitigate effects of SEU, MBU, SEFI, and/or SET type errors.
Various embodiments can be constructed and adapted for use in a space environment, generally considered as 50 km altitude or greater, and included as part of the electronics system of one or more of the following: a satellite, or spacecraft, a space probe, a space exploration craft or vehicle, an avionics system, a telemetry or data recording system, a communications system, or any other system where memory storage may be useful. Additionally, embodiments may be constructed and adapted for use in a manned or unmanned aircraft including avionics, a unmanned aerial vehicle (UAV), telemetry, communications, navigation systems or a system for use on land or water.
With reference now to
In some embodiments, the first error detection and correction is performed using a parallel block code encoded across the plurality of memory devices of memory module 210. For example, if memory module 210 includes a large number of flash memory devices, blocks of code stored across several of the devices may be encoded by the EDAC module 215. Thus, if one of the devices fails, the missing data from that device may be corrected using the parallel block code. This error correction and detection may thus be used to mitigate SEFIs that affect a complete memory device. This first error detection and correction may be an error detection and correcting code that encodes data stored across several devices of memory module 210. According to some other embodiments, the first error detection and correction code may include serial block code (rather than a parallel block code) encoded across the plurality of memory devices. The second error detection and correction, in some embodiments, is performed using a serial block code encoded in the plurality of pages within the one or more memory devices of memory module 210, and may be used to mitigate single event upset induced single and multiple bit flips within a page of a memory device within memory module 210. The serial block code of the second error detection and correction may be an error detection and correcting code that encodes data within a page of data stored within a memory device. According to some embodiments, the second error detection and correction code may include parallel block code for encoding of data stored within a page of data within the one or more memory devices. In some embodiments, the data encoded using the serial and/or parallel block code is stored within each memory device in spare memory storage at the end of each memory page.
Thus, embodiments provide an efficient implementation for a robust error detection and correction systems and methods. Embodiments employing such error correction and detection may allow the use of a smaller quantity of memory and/or fewer processing resources (such as resources within a FPGA) than possible with traditional error correction and detection. Using error detection and correction algorithms across multiple dimensions of a memory system to correct for multiple classes of error mechanisms in spacecraft memory systems may thus provide for robust and efficient spacecraft, where efficient use of resource is highly desirable. The systems and methods of various embodiments of this disclosure also fit well in current flash memory devices by utilizing the spare memory storage at the end of each flash memory page to store the check symbols for the serial block codes on each memory device.
Referring now to
With reference now to
As mentioned, above, various embodiments use serial block code to encode data stored within pages of data in a memory device. With reference now to
With reference now to
With reference now to
At block 720, encoded data is stored in memory devices. At a later time, data is retrieved from memory devices, as indicated at block 725. At block 730, single event functional interrupts affecting a complete memory device are corrected using the first encoded data. Such correction may use the encoded data to determine any erroneous or missing bits in the data. Finally, at block 735, single and multiple bit flips within a page of a memory device are corrected using the second encoded data. Such correction may use the encoded data to correct erroneous bit(s) in the data. Such errors in data or device failures may be the result of any of a number of situations. For example, in systems operating in a space environment, radiation effects such as described above may impact a memory device, or one or more bits stored within a memory device, resulting in a fault with respect to data stored in the memory devices. The methods described with respect to
The detailed description set forth above in connection with the appended drawings describes exemplary embodiments and does not represent the only embodiments that may be implemented or that are within the scope of the claims. The term “exemplary” used throughout this description means “serving as an example, instance, or illustration,” and not “preferred” or “advantageous over other embodiments.” The detailed description includes specific details for the purpose of providing an understanding of the described components and techniques. These techniques, however, may be practiced without these specific details. In some instances, well-known structures and devices are shown in block diagram form in order to avoid obscuring the concepts of the described embodiments.
The various illustrative blocks and modules described in connection with the disclosure herein may be implemented or performed with a general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general-purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, multiple microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
The functions described herein may be implemented in hardware, software executed by a processor, firmware, or any combination thereof. If implemented in software executed by a processor, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Other examples and implementations are within the scope and spirit of the disclosure and appended claims. For example, due to the nature of software, functions described above can be implemented using software executed by a processor, hardware, firmware, hardwiring, or combinations of any of these. Features implementing functions may also be physically located at various positions, including being distributed such that portions of functions are implemented at different physical locations. Also, as used herein, including in the claims, “or” as used in a list of items prefaced by “at least one of” indicates a disjunctive list such that, for example, a list of “at least one of A, B, or C” means A or B or C or AB or AC or BC or ABC (i.e., A and B and C).
The previous description of the disclosure is provided to enable a person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the spirit or scope of the disclosure. Throughout this disclosure the term “example” or “exemplary” indicates an example or instance and does not imply or require any preference for the noted example. Thus, the disclosure is not to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.