The invention relates to the field of error correction codes (ECC) and ECC encoders and more particularly to ECC encoders for use in NAND Flash Memory controllers in devices such as disk drives, solid-state drives (SSDs) and mobile communication systems.
A Flash memory module 101 typically includes a controller 10 is typically used to provide the host interface on one side and to control and access to an array of NAND Flash memory devices 10F as shown in
A NAND Flash memory array is grouped into blocks, e.g. “128 KB” block, which must be erased as a unit. Erasing a block sets all bits to 1. A programming operation, which typically can be performed on byte units, changes erased bits from 1 to 0. Each block is further organized into a set of fixed sized pages, for example with each page nominally having 512 bytes, 2 KB, 4 KB, or 8 KB according to the design. For example, a “128 KB” block might have 64 pages that each store 2048 (2K) bytes data. However, each page will typically include additional “spare” bytes beyond the nominal data byte value of otherwise identical memory cells that can be used for ECC or other system functions. If there are 64 bytes of additional “spare” memory cells, the “2048-byte” page actually includes a total of 2112 bytes of memory.
NAND Flash memory devices typically require associated error correction code (ECC) systems to provide data integrity given the frequency of bad blocks. Flash memory controllers typically include an error correction code (ECC) encoder 10E capability that can be enabled when required. With ECC enabled a programming operation includes the generation of a set of redundant parity or check bits that are calculated using the data bytes to be stored in the sector or block. The ECC bits are written to the memory along with the corresponding data. When the data is read back, the ECC bits are also read, and the ECC Decoder 10D system uses the ECC bits for error detection and correction within the system's limitations. The number of errors that can be corrected depends on the design. When writing data and ECC information to a page, the ECC information can be written as a contiguous set of bytes that is, in effect, appended to the data, it is also possible to interleave data and ECC information. The ECC check bits are calculated from a predetermined unit of data, which does not necessarily correspond to the page size. Thus the ECC unit is sometimes called a sector to distinguish it from a page.
ECC engines (encoders and decoders) can be embedded in the controller chip hardware or ECC can be provided externally by hardware or software. A NAND Flash controller can implement on-the-fly correction by using a buffer to store data while the ECC decoder performs the computations needed for the correction. The ECC algorithms that are often mentioned for use with Flash memory are Hamming codes, Reed-Solomon codes and BCH codes. Bose-Chaudhuri-Hocquenghem (BCH) codes, which are a type of cyclic error-correcting codes that use finite fields, are the subject of the present application. BCH codes are advantageous in that they allow an arbitrary level of error correction and are relatively efficient in the number of gates required in a hardware implementation.
A multi-bit error correction based on a BCH code for a memory is described in US patent application 20120311399 by Yufei Li, et al., published Jun. 12, 2012. The error correction process includes repeatedly shifting the BCH code and, at the same time, determining whether the number of errors decreases.
In US patent application 2011/0185265 by Cherukari, published Jul. 28, 2011, agile encoder for encoding a linear cyclic code such as a BCH code. The generator polynomial for the BCH code is provided in the factored form. The number of factored polynomials (minimal polynomials) chosen by the system determines the strength of the BCH code. The strength can vary from a weak code to a strong code in unit increments without a penalty on storage requirements for storing the factored polynomials.
U.S. Pat. No. 6,519,738 to J. Derby (Feb. 11, 2003) describes a cyclic redundancy code (CRC) computation based on state-variable transformation. The method computes a CRC of a communication data stream taking a number of bits M at a time to achieve a throughput equaling M times that of a bit-at-a-time CRC computation operating at a same circuit clock speed. The method includes (i) representing a frame of the data stream to be protected as a polynomial input sequence; (ii) determining one or more matrices and vectors relating the polynomial input sequence to a state vector; and (iii) applying a linear transform matrix for the polynomial input sequence to obtain a transformed version of the state vector.
U.S. Pat. No. 7,539,918 to Keshab Parhi (May 26, 2009) also describes a method for generating cyclic codes for error control in digital communications.
U.S. Pat. No. 8,286,059 to C. Huang, Oct. 9, 2012, describes a word-serial cyclic code encoder. The cyclic code encoder adds input words to output register words, generating a feedback word, which can be supplied through a feedback loop that selectively transmits feedback words through weight arrays and intra-register adders, to the input of word registers. A controller can operate the cyclic code encoder in either an input mode or an output mode during which feedback words can be sequentially transmitted on the feedback loop and the states of the word registers can be updated and the final states of the word registers can be sequentially shifted out of the output word register as parity words, respectively.
Linear feedback shift registers (LFSR) are used in the cyclic redundancy check (CRC) operations and BCH encoders. Manohar Ayinala, et al. have discussed unfolding techniques for implementing parallel linear feedback shift register (LFSR) architectures. (Manohar Ayinala, et al., High-Speed Parallel Architectures for Linear Feedback Shift Registers; IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 9, SEPTEMBER 2011, pp. 4459-4469.)
Recent FLASH memory applications require an ECC encoder that cannot be implemented by a standard bit-serial Linear Feedback Shift Register (LFSR). The prior art attempts to solve these two problems by ‘LFSR-Unfolding’ and Chinese-Remainder-Theorem (CRT), where LFSR-unfolding solves the multiple bit throughput problem and CRT addresses the long ‘fan-out’ problem that limits the frequency at which the encoder can be used. There is a need to provide one solution that solves both problems.
Embodiments of the invention are methods of encoding and ECC Encoders that process packets of p bits (with p>1) in a data block in parallel and generate a set of parity/check bits that are stored along with the original data in the memory block and allow correction of errors when the block is read back. Encoders according to the invention can be used to create a nonvolatile NAND Flash memory write cache with BCH-ECC for use in a disk drive that can speed up the response time for some write operations. The terms “parity bits” and “check bits” are used interchangeably herein. Embodiments can be designed to efficiently provide correction of a very large number (t) of bit errors in a data block during read back. Encoder embodiments of the invention use Partial-Parity Feedback along with a XOR-Matrix Logic Module, which calculates N output bits from p input bits, and a Shift Register Module that accumulates N check bits, where N is the number of parity/check bits for the data block and N is greater than p. The XOR-Matrix Logic Module is designed using precalculated Matrix of p×N bits, which is translated into VHDL design language to generate the hardware gates. High-Order p-bit Partial-Parity Feedback improves over LFSR designs and achieves Minimal Critical Path Length:=p.
Embodiments of the present invention precalculate the entries for the Matrix by finding the remainder polynomials of all the single-bit inputs, within a p-bit window-input, and constructing a p×N basis matrix that can be directly converted to VHDL-XOR-logic. The p-bit Partial-Parity Feedback used, which is the length of the critical path, is much smaller than the LFSR-feedback, and is optimal, as it is equal to the ‘bus width’. The selected value for p is predetermined by the design. An exemplary embodiment uses p=16, but higher or lower values can be selected according to the principles of the invention. Higher values for p imply wider bus widths and increased speed at the expense of more circuitry.
As the packets of p bits are iteratively processed, the highest p bits in the Shift Register from the previous cycle are shifted out and fed back as the Partial Parity Feedback to be XOR'ed with the next p-bit input packet. The lowest p bits in the Shift Register are loaded with zeroes on each cycle. The XOR Array Multiplier iteratively accepts packets of p bits as input and generates parallel output of N bits that are fed to the Shift Register Module which XOR's the shifted contents of the Shift Register to generate the new Shift Register content. The contents of the Shift Register, at the end of iteratively processing the set of packets for the input data unit, are the N check bits corresponding to the data block.
An exemplary embodiment for an ECC block with 1088 data bytes (2-pages of 544 bytes each) uses p=16, t=42 bit-correction capability with a Galois-Field (GF(2̂14)) for N=588 bits required parity bits and a 588-bit Shift Register. The XOR-Matrix Logic Module accordingly has 16-bit wide data input, and 588-bit parity output to the 588-bit Shift Register Module. The output parity bits are in low-to-high order and the 16-bit data input is in high-to-low order. The final set of parity values, accumulated in 588-bit Shift Register are read out in high-to-low order, i.e. in the reverse order.
In the exemplary embodiment the input data is processed in 16-bit packets. The 588-bit Shift Register is initialized with zeroes. At the start of each cycle the contents the 588-bit Shift Register are shifted up 16 bits and the most significant 16 bits, which are shifted out, are latched for use as the Partial-Parity Feedback into the first processing stage. As 16 bits are shifted out at the top, 16 bits of zeroes are shifted in at the bottom of the Shift Register. Each 16-bit packet is XOR'ed with the latched 16 bits that were shifted out from the 588-bit Shift Register. The result of the first stage is then multiplied by the 16-by-588 Matrix to produce a new 588-bit second stage output that is XOR-ed with the shifted 588-bit Register content to form the new Shift Register content. This cycle is repeated until the last 16-bit packet has been processed. The final 588 bits in the Register are clocked out and stored with of the data block. The design and operation of the Decoder follows from the specification of the Encoder as described herein and can be otherwise implemented using prior art principles.
An ECC encoder embodiment of the invention can be used in various applications, but in particular a Flash memory controller with an ECC encoder embodiment of the invention can be included in a disk drive for use, for example, as a write cache, to create a nonvolatile memory (NVM) with BCH-ECC that will speed up the response time for certain commands while ensuring high data reliability.
An ECC Encoder 11 embodiment of the invention including XOR Matrix Logic Module 13, Register Module 12, Partial-Parity Feedback Latch 28 and XOR input module 14 is illustrated in
The Encoder 11 processes packets of 16 bits at a time; therefore, 544 iterations/cycles are needed to process the 1088 byte data block 201 and generate the 588 check bits 202 that will be stored along with the original data in the Flash memory. The Shift Register 12R and Output Register 27 are initialized to all zeroes at the start of each data block. In each 16-bit cycle iteration the contents of the Shift Register are shifted up 16 bits in response to the Shift_16 Control line and the lowest 16 bits in the Shift Register are loaded with zeroes. Thus, as 16 bits are shifted out at the top, 16 bits of zeroes are shifted into the bottom of the Shift Register. The highest 16 bits in the Shift Register (which are from the previous cycle except for the first iteration) are shifted out and stored in Partial-Parity Feedback Latch 28 which feeds the bits back to be XOR'ed with the 16-bit input packet by XOR Module 14. The contents of the Shift Register after the shift operation are loaded into Output Register 27 as part of each iteration. In the last iteration, the final contents of the Shift Register are loaded into Output Register 27 without shifting to supply the final check bits at the end of the process. Output Register 27 also the supplies input back to XOR module 25, which also has input from the XOR Matrix Logic Module (XMLM) 13.
The XOR Matrix Logic Module 13 iteratively accepts packets of p bits (with p=16) as input and generates parallel output of N bits (with N=588) that are fed to the Register Module 12. Register Module 12 XOR's the new input with the current contents of the Output Register 27 to generate the new Shift Register content. The contents of the Output Register, at the end of iteratively processing the set of packets for the input data block, are the N check bits corresponding to the data block. In this embodiment the output check/parity bits are in low-to-high order and the 16-bit data input is in high-to-low order. The final set of parity/check values, accumulated in 588-bit Output Register are read out in high-to-low order, i.e. in the reverse order.
Each 16-bit input packet is XOR'ed with the Partial-Parity Feedback Latch's 16-bits by the XOR logic module 14 which generates a 16-bit result that is input into the XOR Matrix Logic Module (XMLM) 13. The XMLM takes the output of XOR logic module 14 and produces a 588-bit second stage output that is sent to Register Module 12. Register Module 12 XOR's the new input with the current/old 588-bit Register content to form the new Shift Register content. This cycle is repeated until the last 16-bit packet has been processed. The final 588-bits in the Output Register are clocked out and stored with of the data block.
The P(i) result is then XOR'ed with the (old) content of the Shift Register to derive the new content of the Shift Register 45. Note that in the hardware diagram in
The predetermined functions that map the p bits in S′(i) to N bits in P(i) are determined by generating a p×N Matrix. Embodiments of the present invention precalculate the entries for the Matrix by finding the remainder polynomials of all the single-bit inputs, within a p-bit window-input, and constructing a p×N basis matrix that can be directly converted to VHDL-XOR-logic. The p-bit feedback used, which is the length of the critical path, is much smaller than the LFSR-feedback, and is optimal, as it is equal to the ‘bus width’.
The assumed design parameters require a high bit-correction “t=42” capability for a 2-page (544 byte each) total block of 8*2*544=8,704-bit. This number is bigger than 2̂13, but smaller than 2̂14, thus the Galois-Field (GF) required to locate bit-errors within the 8,704 data-block is GF(2̂14), thus the number of required parity bits, to correct 42 bit-errors, is 42*14=588 bits. The coded data block thus consists of 8,704 data-bits+588 parity bits=9,292, however, this number is not divisible by 14, to make it divisible by 14 requires a “pad” of 4 bits, thus making the coded block-size=9,296, hence the BCH-Code is [k=8,704, n=9,296, t=42], where “k” is the number of uncoded data bits, “n” is the number of coded block bits and “f” is the bit-correction capability.
An additional assumed requirement of the design is that data is processed at a rate of “p=16”/system clock, i.e. the encoder/decoder hardware has to process the data in 16-bit “packets”. A system with an 16-bit wide/588-bit Binary Encoder Encoder according to an embodiment of the invention should also include corresponding Decoder that will include Functional Units of:
The generator polynomial “g(y)” of a t-bit error correcting BCH-Code, of block size “2̂(m−1)<N<2̂(m)”, is the least-common-multiple (LCM) of the minimum polynomials of its roots “g(âi)=0”, i=1, . . . , 2t”, where “a” is the primitive element of the Galois Field “GF(2̂m)”. The block N requires “m=14”, where the Galois Field GF(2̂14) is generated by a quadratic extension of GF(2̂7). Since the application requires “t=42”, calculation of 42 minimal polynomials is required, each of degree “m=14” and, since they have no common factors, their “LCM” equals to their product, a binary polynomial “g(y)” of degree 14*42=588.
The calculation of these 42 minimal polynomials is effectively done by resultants, using standard mathematics. The resultant of two polynomials can be computed using standard computer algebra systems. The resultant of two polynomials is a polynomial expression of their coefficients. There are two nested resultant calculations “resultant {resultant [y−(u*v+1)̂k,û7+u+1, u],v̂2+v+1,v}, for k=1, . . . , 42”. The first resultant calculation uses “û7+u+1” [which generates GF(2̂7)], and the second uses “v̂2+v+1”, which is the quadratic extension of GF(2̂7) to GF(2̂14). The output of this calculation is a list of 42 polynomials in the variable “y”, of degree 14 each, that have no common factor. Their product is the degree-588 generator polynomial “g(y)”.
These 42 polynomials have no common factors; thus their product, a polynomial of degree 42*14=588, is the encoder polynomial “g—{588}(y)”, shown in
A textbook Linear-Feedback-Shift-Register (LFSR), which is the standard circuit for implementing a BCH-Encoder, is a shift register that is hardwired by the binary coefficients of the encoder polynomial. For the application described herein this register would be 588-units long, and its critical path feedback would be too long for a 270-MHz clock implementation. Furthermore it is a single-bit bus encoder.
The solution of these two problems in embodiments of the invention results in the implementation of a minimal critical path, high-speed parallel BCH ECC encoder. The Ayinala 2011 article cited above provides background on LFSR-Unfolding concepts.
CRT reduces the critical path feedback by parallel division of the data input, by the individual 42 polynomials of degree 14 each, but it is still a single bit input processor. Thus prior art LFSR unfolding solves LFSR “p-Parallel Bit” Encoding and Chinese-Remainder-Theorem (CRT) can be used to reduce LFSR “t*m” Critical Path Length [where “m”:=Error Locator GF Size].
The disclosed solution in embodiments of the present invention results in “p-by-rm” XOR-VHDL Matrix-Encoder with High-Order “p”-bit Partial-Parity Feedback which eliminates LFSR while solving both stated problems and achieving Minimal Critical Path Length:=“p”.
The calculation of the minimal critical path feedback/programmable parallel-p-packet BCH encoder 11 solution, as shown in
The coefficients of these polynomials form a Boolean matrix (e.g. “tmatarray”), of 16-by-588:
tmatarray=transpose(matrix[coefficients(rk(y)]) (equ-2)
This Matrix is directly translated into standard hardware description language VHDL (VHSIC Hardware Description Language) Logic, as illustrated below. There are 16 input bits (i:in bit_vector(0 to 15)) and 588 output bits (o:out bit_vector(0 to 587)). Each of the output bits is a predetermined function of selected input bits. For example, the first output bit defined below “o(0)” is the XOR of input bits 0, 4, 5, 7, 9, 10, 11, 12, and 14. Output bits o(6) through o(584) are omitted for brevity. The omitted entries are determined as described above.
The resulting circuit architecture embodiment of the invention shown in