The present invention generally relates to error correction coding for information transmission, storage and processing systems, such as wired and wireless communications systems such as optical communication systems, computer memories, mass data storage systems. More particularly, it relates to encoding method and apparatus for block codes such as low-density parity check (LDPC) codes, and more specifically to LDPC codes with block parity check matrices, called quasi-cyclic (QC) LDPC codes.
Low-Density Parity Check (LDPC) Codes
Error correcting codes play a vital role in communication, computer, and storage systems by ensuring the integrity of data. The past decade has witnessed a surge in research in coding theory which resulted in the development of efficient coding schemes based on low-density parity check (LDPC) codes. Iterative message passing decoding algorithms together with suitably designed LDPC code ensembles have been shown to approach the information-theoretic channel capacity in the limit of infinite codeword length. LDPC codes are standardized in a number of applications such as wireless networks, satellite communications, deep-space communications, and power line communications.
For a (N, K) LDPC code with length N, dimension K, the parity check matrix (PCM) H of size M×N=(N−K)×N (assuming that H is full rank), is composed of a small number of ones. We denote the degree of the j-th column, i.e. the number of ones in the j-th column, by dv(j), 1≤j≤N. Similarly, we denote the degree of the i-th row, i.e. the number of ones in the i-th row, by dc(i), 1≤i≤M. Further, we define the maximum degree for the rows and columns:
When the number of ones in the columns and the rows of H is constant, the LDPC code is called regular, otherwise the LDPC code is said irregular. For regular LDPC codes, we have γ=dv=dv=dv(j), 1≤j≤N, and ρ=dc=dc(i), 1≤i≤M. The (dv, dc)-regular LDPC code ensemble represents a special interesting type of LDPC codes. For this type, the code rate is R=K/N=1−dv/dc if the PCM H is full rank.
If a binary column vector of length N, denoted x=[x1, x2, . . . , xN]T, is a codeword, then it satisfies Hx=0, where the operations of multiplication and addition are performed in the binary field GF(2), and 0 is the length-M all-zero column vector. xT denotes the transposition of x, both for vectors and matrices. An element in a matrix can be denoted indifferently by Hm,n or H(m, n). Similarly, an element in a vector is denoted by xn or x(n). The horizontal concatenation, respectively vertical concatenation, of vectors and matrices is denoted [A, B], respectively [A; B].
Quasi-Cyclic LDPC Codes The present invention particularity relates to the class of quasi-cyclic LDPC codes (QC-LDPC). In QC-LDPC codes, the PCM H is composed of square blocks or submatrices of size L×L, as described in Equation (2), in which each block Hi,j is either (i) a all-zero L×L block, or (ii) a circulant permutation matrices (CPM).
A CPM is defined as the power of a primitive element of a cyclic group. The primitive element is defined, for example, by the L×L matrix, α, shown in Equation (3) for the case of L=8. As a result, a CPM αk, with k∈{0, . . . , L−1} has the form of the identity matrix, shifted k positions to the left. Said otherwise, the row-index of the nonzero value of the first column of αk, is k+1. The value of k is referred to as the CPM value. The main feature of a CPM is that it has only a single nonzero element in each row/column and can be defined by its first row/column together with a process to generate the remaining rows/columns. The simplicity of this process translates to a low hardware resources needed for realizing physical connections between subsets of codeword bits and parity check equations in an LDPC encoder or decoder.
The PCM of a QC-LDPC code can be conveniently represented by a base matrix (or protograph matrix) B, with Mb rows and Nb columns which contains integer values, indicating the powers of the primitive element for each block Hi,j. Consequently, the dimensions of the base matrix are related to the dimensions of the PCM the following way: M=Mb L, N=Nb L, and K=KbL (assuming that H is full rank). An example of matrices H and B for Mb×Nb=4×5 and L=8 is shown in Equation (4).
where I=α0 is the identity matrix, and by convention α−∞=0 is the all-zero L×L matrix.
General Circulant Matrices
In this invention, the blocks Hi,j are square L×L matrices, which are either a CPM or a all-zero block. However, the process of encoding a codeword x cannot be made from the PCM, but from a generator matrix G, as explained in paragraph 006. There are rare exceptions where the encoding can be performed directly with H, like in the case of low density generator matrices (LDGM) codes, but we do not discuss those cases in this disclosure.
General circulant matrices, not restricted to single CPMs, can appear during the process of computing G from H. A circulant matrix C is defined by the sum of w CPM with distinct powers:
where ik≠ik′, 0≤ik≤L−1 and 0≤ik′≤L−1, ∀(k, k′). In this definition, w is called the weight of circulant C, and we have 0≤w≤L. Special cases of circulants are: (i) when w=0, the circulant is the all-zero block, (ii) when w=1, the circulant is a CPM.
Encoding
This invention pertains to LDPC codes and their encoders and is applicable to any QC-LDPC code class, regular or irregular.
A codeword of an LDPC code, x=[x1, x2, . . . , xN]T, is created from a vector of information bits u=[u1, u2, . . . , uK]T, following the equation x=G u, where G is a generator matrix of the code. Without loss of generality, we assume that G is put in a systematic form, and that the codeword is organized as
where r=[r1, r2, . . . , rM]T is the vector of redundancy bits.
In general, a generator matrix G can be obtained from a parity check matrix H by computing the reduced row echelon form of H, by means of Gaussian Elimination (GE). Since H is composed of circulants, and it is known that the inverse of a circulant matrix is also a circulant matrix, it follows that G is also composed of circulants, although not necessarily reduced to CPMs. The encoding equation becomes:
where P is a dense matrix composed of general circulants.
For QC-LDPC codes, the GE process can be performed at the circulant level, making use only of operations on circulant matrices (multiplications, additions, inverse). This also means that the leading coefficients used to pivot the matrix during GE are themselves circulant matrices. Two operations can increase the weights of the circulants:
An alternative of using a direct inverse of matrix H is to transform it into an upper triangular matrix, using a greedy Sparse Gaussian Elimination. In GE, one can stop the process when the block-row echelon form has been found, that is when the resulting matrix is in upper triangular form. For a matrix which contains all-zeros blocks and a small number of CPM, the Gaussian Elimination algorithm can be constrainted such that the resulting upper triangular matrix is the sparsest possible. Indeed, if the reduced block-row echelon form is unique and corresponds to the inverse of the original matrix, its block-row echelon form is not unique and depends on the sequence of leading coefficients that are used to perform the Gaussian Elimination. In other words, depending on the sequence of leading coefficients, one can get different encoding matrices with different sparseness. The encoding equation become
where H1 is a dense matrix of size Mb×Kb composed of general circulants, and H2 is an upper triangular matrix (at the block level), of size Mb×Mb, also composed of general circulants. The encoder proceeds in two steps: (i) the first step consists in computing the parity check values of the first part of Equation (8), c=H1 u, (ii) the second step recursively computes the values of the redundancy bits r using backward propagation, since the matrix H2 is upper triangular.
To summarize, matrices P or (H1, H2) are used to perform the encoding, depending on the chosen encoding procedure. Those matrices are usually composed of general circulant blocks with large weights, typically of the order of L/2, where L is the circulant size. From the hardware complexity standpoint, it is advantageous to design design QC-LDPC codes for which the corresponding encoding matrix is as sparse as possible:
It is therefore highly beneficial to consider an encoding procedure where the encoding matrices are sparse, composed of circulant blocks with low weights. The sparseness of the encoding matrix is not ensured by the classical encoding methods. We propose in this invention a method to use a very sparse representation of the encoding matrix, in order to obtain an encoding architecture which has both low complexity and high throughput.
Layers and Generalized Layers of QC-LDPC Codes
For a base matrix B of size (Mb, Nb), the rows (or block-rows of H) are referred to as horizontal layers (or row layers). The concept of layer has been introduced originally for the case of regular array based QC-LDPC codes, where the base matrix is full of CPMs, and contains no α−∞, i.e., no all-zero L×L block.
The concept of layers in a base matrix can be further extended to the concept of generalized layer (GL). The definition follows:
This definition ensures that for a QC-LDPC code with maximum column degree γ, the PCM can be organized with at least γ GLs. For simplicity, we will assume that the number of GLs is always equal to the maximum column degree γ.
In a PCM with a generalized layers structure, the block-rows of each generalized layer may be organized in an arbitrary order, and do not necessarily contain consecutive block-rows. In
The purpose of the GL organization of the PCM is to be able to perform processing of at least γ PCM blocks in parallel, without data access conflicts. This allows a larger degree of parallelism in the encoder/decoder algorithms, and leads to higher processing throughputs. Although the concept of layers and generalized layers for QC-LDPC codes has been introduced in the literature to improve the efficiency of the LDPC decoders, we show in this disclosure that such structure can also be beneficial for the QC-LDPC encoder.
QC-LDPC with Minimum Row-Gap Between Circulants
The organization of H in generalized layers is proposed to improve the throughput of the encoder. However, in order to reach very high throughputs, the PCM needs to have other constraints, with respect to the location of the CPM in each block-row. This constraint, that will be used in the description of this invention, is called gap constraint, and reflects the number of all-zero blocks that are between two consecutive CPM in the same block-row of H. On
Let bi be the vector which stores the positions of nonzero blocks in the i-th row of the base matrix B. The vector bi is a length dc(i) vector, such that Bi,b
gapi,k=bi,k+1−bi,k 1≤k≤dc(i)−1 (9)
For example, for the matrix in (4), we have b3=(1, 2, 4, 5), and gap3=(1, 2, 1).
For the whole QC-LDPC matrix, the minimum gap is defined as:
Obviously, any QC-LDPC code has minimum gap*=1. As explained later in this disclosure, using QC-LDPC codes with a constraint on the minimum gap*>1 allows to derive encoders with larger throughputs. The gap constraint gap*>1 can be enforced by direct design of the PCM or the BM of the QC-LDPC code, or by re-arrangement of the columns.
The present invention relates to encoding of LDPC codes. In particular it relates to encoding of LDPC codes whose parity check matrices are organized in blocks. More specifically it relates to block matrices that are circulant permutation matrices, in which case the code is called quasi-cyclic (QC) QC-LDPC codes.
The present invention relates to a method and hardware apparatus to implement LDPC encoders in order to achieve both high encoding throughput as well low hardware resource usage. This is achieved by imposing various constraints on the structure of the parity check matrix and tayloring the hardware processing units to efficiently utilize these specific structural properties, in the process of computing the redundancy bits from the information bits, and combining them into a codeword.
The method of the present invention is directed in utilizing the organization of the parity check matrices of QC-LDPC codes into block-rows and block-columns. The block-rows of the parity check matrix are referred to as layers.
The method of the present invention can be used to implement encoding algorithms and devise hardware architectures for QC-LDPC codes. In the class of codes whose parity check matrices also contain all-zero blocks, codes with regular or irregular distribution of all-zero blocks and non-zero blocks inside the parity check matrix can be encoded using the present invention.
Additionally, in some embodiments, the method of the present invention imposes the constraints with respect to the relative locations of the non-zero blocks in each block-row. More specifically, the constraints relate to the minimal number of all-zero blocks that must separate any two consecutive non-zero blocks in any given block-row. This separation is referred to as the gap constraint. The advantage of said gap constraint is in improving processing throughput.
Additionally, in some embodiments, the method of the present invention organizes the parity check matrix into generalized layers, composed of the concatenation of several block-rows. Each generalized layer of the parity check matrix contains at most one non-zero block in each block-column. The advantage of generalized layers is that they allow parallelization of processing of the blocks of the parity check matrix without data access conflicts, which in turn increases the encoding throughput.
In some embodiments, each generalized layer contains the same number of block-rows, and in preferred embodiments, the parity check matrix has as many generalized layers as the maximum block-column degree. The block rows that compose the generalized layers can be chosen in an arbitrary way: consecutive, interleaved or irregular.
The method of the present invention achieves high throughput and low hardware usage by imposing an additional specific structure, called the three-region structure, to the parity check matrix, and utilizing a method for computing redundancy bits from information bits adapted to this specific structure.
The three-region structure of the parity check matrix described in this invention is a decomposition of said matrix into three regions, (R1, R2, R3), wherein each region is processed by a specific method and a specific apparatus, while the apparata for processing different regions share a common hardware architecture. The present invention relates to a method and hardware apparatus to implement said three-region processing.
Various features of the present invention are directed to reducing the processing complexity with an hardware-efficient encoder apparatus by utilizing the three-region structure of the QC-LDPC code, In addition, it utilizes the gap constraint as well as composition, structure and position of generalized layers. From an implementation standpoint, the three-region encoding method of the present invention has the advantage of reducing the amount of resource required to implement the encoder.
In the present invention, the Nb block-columns of the parity check matrix are partitioned into three regions R1, R2 and R3, defined by sub-matrices with Nu, Nt and Ns block-columns, respectively. In the present invention, the sub-matrix in the region R2 is block-upper triangular, and the sub-matrix in the region R3 contains an invertible Ns×Ns block matrix S concatenated on the top of a (Mb−Ns)×Ns all-zero block-matrix. In some embodiments, additional structures are imposed on the three regions. In some embodiments, the sub-matrix in the region R1 is organized into said generalized layers, and has a minimum gap constraint of at least 2 in all its blocks-rows. In some embodiments, the sub-matrix in the region R2 is also organized into generalized layers, and has a minimum gap constraint of at least 2 for all the non-zero blocks of its diagonal.
In accordance with one embodiment of the present invention, the computation of redundancy bits involves computation of parity check bits, wherein blocks of information bits are processed in parallel, producing Mb blocks of parity check bits. These Mb blocks of parity check bits are initialized by multiplying information vector with the sub-matrix in region R1, and then processed using the sub-matrix of region R2. The upper triangular structure of the sub-matrix in the region R2 facilitates a recursive computation of the parity check bits in region R2. Once the processing for regions R1 and R2 is finished, a small portion of Ns blocks of parity check bits are used to compute the last set of redundancy bits. One feature of the present invention is that the size Ns of the invertible matrix S in region R3 is much smaller than the number of block-rows, which further reduces the computational resources needed to multiply its inverse S−1 with the Ns blocks of parity check bits, which computes the remaining set of Ns blocks of redundancy bits.
Additional features of the present invention are directed to reducing the hardware resource and delay of the storage and processing of the inverse S−1. While all the submatrices in all regions are sparse, including the matrix S, the method of present invention ensures that the inverse matrix A=S−1 is also sparse. This feature additionally reduces storage requirements of hardware for processing of the matrix A.
Since the matrix S contains blocks of circulant permutation matrices and all-zero blocks, its inverse A is composed of circulant matrices and all-zero blocks, but the weight of circulant matrices in A may in general be large, usually of the order of half the circulant size. The method of the present invention is directed to using the matrices that have low-weight inverses, very much smaller than the circulant size.
One embodiment of the present invention capitalizes on the fact that the maximum circulant weight in the j-th block-column of A is no larger than wmax,j. This allows to process sequentially the j-th block-column of A, performing only multiplications with circulant permutation matrices.
One feature of the present invention is that the encoding method can be implemented in hardware using a single architecture of the circuit, for the three regions R1, R2 and R3. This implementation allows a maximum hardware re-use, and therefore minimizes the amount of resource needed to build the encoding circuit. In each region, the inputs, outputs, and control instructions of the hardware model are though different.
It will be seen in the detailed disclosure that the objects set forth above are efficiently attained and, because some changes may be made in carrying out the proposed method without departing from the spirit and scope of the invention, it is intended that all matter contained in the description and shown in the accompanying drawings shall be interpreted as illustrative and not in a limiting sense.
For a more complete understanding of the invention, reference is made to the following description and accompanying drawings, in which:
Sparse Structure of the Parity Check Matrix H
In the present invention, the parity check matrix of an QC-LDPC code will be organized with the following form
where 0(M
Similarly to the PCM H, the matrices (U, T, S) are composed of circulant blocks, and for ease of presentation, the dimensions of the submatrices are indicated in number of circulant blocks. This means that, following the notations of paragraph 0004,
Mb=Mu=Mt=(Nt+Ns) (12)
Nb=(Nu+Nt+Ns) (13)
Kb=Nu (14)
The PCM is organized into three regions:
An example of PCM with such structure is shown in
The present invention comprises also PCM deduced from the structure shown in (11) by any block-row and/or block-column permutation.
Organization of the PCM in Generalized Layers, with Minimum Gap
In a preferred embodiment of this invention, the PCM H is organized in generalized layers, which in turn imposes the constraint that the submatrices (U, T, S) are also organized in generalized layers.
For illustration purposes, we follow the organization of the interleaved generalized layers presented in paragraph 0007, other arrangements may be possible, and any other organization of the block-rows in GLs obtained through permutations of block-rows is also covered with this invention. As a non-limiting illustrative example, the structure of a PCM organized in GLs is shown in
In the example of figure
[Hγi+k,1,Hγi+k,2, . . . ,Hγi+k,N
The same block-row indices in matrices (U, T, S) compose the generalized layer GLk for the corresponding submatrices.
In preferred embodiments of this invention, the matrices U and T further have gap constraints on their block-rows to help improving the throughput of the encoder. For the matrix U, we assume that each and every block-row has a minimum gap≥2. For the matrix T, we assume that, in each and every block-row, the gap between the diagonal block and the previous block is at least gap≥3. In the example of
The combination of the GL organization and the gap constraints allows to derive a high-throughput architecture for region R1 and region R2 encoding.
Codeword Structure
The codeword x can be written as
where xj, 1≤j≤Nb is a binary column vector of length L. Now the equation H x=0 can be written as
All the operations in this disclosure are described as block computation in GF(2), that is using only L×L circulant blocks, e.g. Hi,j and binary vectors of length L, e.g. xj. In particular, the multiplication of xj by a CPM is equivalent to a simple circular shift of the binary values in xj. With the definitions in 0004, the PCM and the exponents in the base matrix B are related as
We have:
The hardware implementation of operation (18) is usually done using Barrel Shifters, such that Hi,j xj is the barrel-shifted version of xj.
In the present invention, the codewords are further organized following the structure of the PCM described in paragraph 0036. We re-write Equation (16) as
where u is the vector of information bits,
is the vector of redundancy bits, and
The binary column vectors uj, tj, and sj are of length L, and for clarity, Nu=Kb is the length of the information vector (the number of length-L sub-vectors in u), and Nt+Ns=Mb is the length of the redundancy vector.
Parity Checks Vector Structure
The encoding process involves computation of the vector of parity checks, denoted by c:
Similar to the structure of the codeword vector, the vector of parity checks is also defined at the block level, i.e. ci is a vector of L binary values. The parity checks are used to compute the vector of redundancy bits, and at the end of the encoding process, we must have H x=c=0.
Let us further define the vector of temporary parity checks, denoted c(j), 1≤j≤Nb, which contain the values of the parity checks using only the first j columns of H:
The vector of temporary parity checks will be useful for explaining the three sequential steps of the encoding process. Note that from the definition of the PCM, the final temporary parity checks vector is equal to the all zero vector, c(N
Parity Checks Vector Structure with Generalized Layers
In a preferred embodiment of this invention, the PCM is organized in γGLs. This means that the vector of temporary parity checks can be partitioned in γ parts, each and every part corresponding to one GL. As an illustrative example, the vector of temporary parity checks c(j) corresponding to the PCM of figure
because according to
In this preferred embodiment, each part of the temporary parity checks vector can be stored in a separate memory, and computed using Equation (23) using a separate circuit.
Encoding Using Structure of H
In this paragraph, we describe the global encoding process that composes the main purpose of this invention, and we will describe in details each step in the next paragraphs.
According to Equation (11), the PCM is composed of three regions (R1, R2, R3), which results in the decomposition of the encoding process into three steps, as explained below.
With the notations of paragraphs 0036 and 0038, Equation (17) can be now written as
The Kb blocks u1, u2, . . . , uK
The three sequential steps of the encoding process are defined as follows:
After the completion of the above three steps, the vector of redundancy bits [t; s] is assembled from the sub-vectors t and s and the encoded codeword x=[u; t; s] is obtained.
Reusable Hardware for Encoding of Regions R1, R2 and R3
As mentioned in paragraph 0041, the encoder is decomposed into three steps, each one based on a different structure of the corresponding sub-matrix of H. For the efficient hardware realization of the encoder, it is desirable to design an architecture which uses the same hardware units for all regions, since it allows to re-use the same circuit for the sequential processing of the three steps.
This invention proposes a single architecture for all three regions of the encoder, which is described in this paragraph. This single architecture is specifically oriented toward the preferred embodiment of the invention described in paragraph 0037, i.e. when the PCM is organized in γGLs, and the minimum gap of matrix U is gap*≥2.
The core input 414 is composed of data, whose nature depends on the region being processed. In region R1, the inputs to the core are the information bits u. In region R2 and R3, the input to the core is not used. The multiplexer block 401 is in charge of selecting the correct data to be sent to the merger block 402, which could be either new input data for region R1 processing, or the data from the recycle stream 413 for region R2 and R3 processing.
The merger block 402 processes the instructions received from the instruction stream 415 and associates them with data coming from 401. The merger block contains a set of registers to store the data, in case they need to be used by several instructions. An instruction may also indicate that the merger block should pause for several clock cycles without consuming more instructions.
Each accumulator is in charge of a different GL, i.e. stores and updates its part of the data, and therefore contains a memory which stores L Mb/γ bits, where Mb/γ is the number of block-rows in a GL. The processed data is not the same in all regions: in regions R1 and R2, the data stored in the accumulator are temporary parity check bits c(j), while in region R3, the data are redundancy bits s. Thanks to the organization of the PCM in GLs, one and only one block of data (L bits) is accessed in parallel in each accumulator, and so memory 411 has a width of L and a depth of Mb/γ. We refer to the contents of address a in memory k as memk,a, 1≤k≤γ. The inputs/outputs to an accumulator are then the following:
The information din and b go to the barrel shifter 410 which produces dbs=αb din, where αb is the CPM being processed. According to the definition of the base matrix B (4), the shift value b is equal to the exponent bj,i for the i-th column and the j-th row of the PCM. The address a is read from the memory 411 to get memk,a which is then XOR-ed with the barrel shifted data to obtain the accumulated binary values dout=memk,a+dbs. The computed data dout is written back to the same address in the memory and serves as output to the unit. The pipes 408 and 409 may or may not be necessary to keep the pipeline lengths matching for different paths.
The instructions 415 received by the core control how the data are processed. An example of the instruction structure is given with the following items:
Encoding of Region R1
The encoding step corresponding to region R1 consists of the computation of the temporary parity checks c(K
This computation can be done sequentially, column by column, from column j=1 to column j=Kb of U. Since H is sparse, U is sparse as well, and the j-th column of U is composed of dv(j) non-zero blocks. Let (i1, . . . , id
The values of the temporary parity checks along the encoding process do not need to be stored, as only the last values c(K
In a preferred embodiment of the invention, the PCM is organized in γ interleaved generalized layers, such that the dv (j) non-zero blocks in each and every column of U belong to a different generalized layer. A simple example of the preferred embodiment is when the QC-LDPC code is strictly regular, with dv(j)=dv=γ. In this case, the matrix U is formed with dv generalized layers, each of one containing exactly one non-zero block per column. Under this preferred embodiment, the dv(j) computations of Step-c in Algorithm 1 can be performed in parallel, using a multiplicity of dv(j) circuits, without any memory access violation. Further details about the hardware realization of this step are described in the next paragraph.
Hardware Realization of Encoding of Region R1
In this region, the data which are processed by the accumulators (503, 504, 505, 506) are temporary parity checks, c(j), 1≤j≤Kb. Each one of the γ accumulators stores and updates its part of the temporary parity check bits. For the example of GL organization given in
The process of encoding for region R1 is described in Algorithm 1. The parity check values c(j) are split into γ parts, each of one being associated with one of the γGLs. The values in c(j) corresponding to a specific GL are not necessarily consecutive, and are arranged in accordance with the GL organisation of the PCM (refer to paragraph 0007). Recall that the number of non-zero blocks in each column j is lower than than the number of GLs, dv(j)≤γ, and that each one of the non-zero block belong to a different GL. As a consequence, all the parity check bits for the j-th column,
can be updated in parallel, with no memory access violation, both for reading and writing.
During the processing of column j in region R1 the information bits uj arrive as input to the encoder core 514. For the j-th column, we indicate by bk,j the shift value corresponding to the single non-zero block in the k-th GL, such that
The address in memory 511 for the temporary parity checks being processed by the k-th accumulator for column j is denoted by ak,j. Both the shift values bk,j, and addresses ak,j arrive as part of the instruction 515. uj is sent to all γ accumulators. While the first accumulator 503 receives b1,j and a1,j, the second accumulator 504 receives b2,j and a2,j and so on. The accumulators use the information bits, shift values, and addresses to update the values of ci
The critical path during R1 is equal to the number of clock cycles taken to read the memory 511, do the XOR with the barrel shifted data, and write the updated value back to the memory. If this process takes τ1 clock cycles, it follows that a minimum gap of gap*=τ1 in matrix U is necessary to achieve maximum throughput. A value of gap*=2 is usually sufficient to achieve high frequency for the synthesized architecture.
Encoding of Region R2
In the region R2, the encoder computes the first set of redundancy bits t, as well as the updated values of the temporary parity check vector c(K
Since the temporary parity check c(K
In the last Ns block-columns of the PCM, only the first Ns block-rows contain non-zero blocks, and the Mb−Ns last block-rows of these columns are all-zero. As a result, the last Nt=Mb−Ns equations of the system (27) do not depend on s nor on S, and may be solved using solely the knowledge of T and c(K
As can be seen from Equation (28), the last equation of this system allows to get the first vector of redundancy bits t1 with a very simple operation.
t1=TM
where TM
Still using the upper-triangular property of T, we can proceed recursively from j=1 to j=Nt to compute the set of redundancy bits tj, 1≤j≤Nt. Moreover, from the definition of the vector of temporary parity checks, the j-th equation can be simplified in the following way:
Using Equations (28) and (30), one can deduce the process for encoding in region R2, described in Algorithm 2. Similarly to the encoding in region R1, the matrix T is composed of a subset of columns of the matrix H, and therefore is sparse. The j-th column of T is composed of dv(j) non-zero blocks. Let (i1, . . . , id
The values of the temporary parity checks along the encoding process do not need to be stored, as only the last values c(K
In a preferred embodiment of the invention, the PCM is organized in γ interleaved generalized layers, such that the dv(j) non-zero blocks in each and every column of U belong to different generalized layers. An example of the preferred embodiment is when the QC-LDPC code is strictly regular, with dv(j)=dv=γ. In this case, the matrix T is formed of dv generalized layers, each of one containing exactly one non-zero block per column. Under this preferred embodiment, the dv(j) computations of Step-c in Algorithm 2 can be performed in parallel, using a multiplicity of dv(j) circuits, without any memory access violation. Further details on the hardware realization of this step are described in the next paragraph.
Hardware Realization of Encoding of Region R2
To simplify the description of the hardware for the encoding of region R2, we assume without loss of generality that the CPM on the diagonal of T are identity matrices. Note that one can easily enforce this constraint for any QC-LDPC code following the structure of (11), by simple block-row and block-column transformations. With this constraint Step-a in Algorithm 2 simplifies to
where j is the column index within region R2.
Since the temporary parity checks c are stored in the accumulator memories, outputting the values of the L redundancy bits tj requires a simple reading the correct address from the correct memory and outputting those L bits. The rest of the processing for region R2 reduces to updating the values of ci
During stage 2a, the first instruction contains the memory addresses ak,j of
used in the k-th accumulator, and an indicator telling in which memory, i.e. in which accumulator, the values are stored. This last instruction is used by Mux B 607. In the example drawn in
goes in the recycle stream 613, where it is acted upon by the second instruction.
During stage 2b, the second instruction tells to output
on the core output 617 and also send it as input to the γ accumulators, where together with the addresses ak,j and shift values bk,j from the second instruction, it is used to update the
values.
The time between stage 2a and stage 2b processing depends on the length of the pipeline given by the thick solid line. To achieve high throughput, the first instruction and second instruction for different columns are interleaved. Table 1 shows an example of how the instructions might be interleaved if the pipeline for the first instruction takes three clock cycles.
To allow this interleaving it is desirable that the pipeline for the first instruction takes an odd number of clock cycles.
In region R2, the time τ1 to read and then update the memory is not the critical path anymore. The time τ2 between reading from a temporary parity check associated with the diagonal, to updating the parity check associated in a different memory, becomes dominant. This path goes from the memory 611 out of the accumulator 606, through Mux B 607, Mux A 601, merging block 602, back into a different accumulator, through the pipe 608, barrel shifter 610, XOR 612 and finally back to the memory 611. The final writing in the memory must be completed before reading the diagonal parity for that memory address. For example, if τ2 is 5 clock cycles, then a gap=3 is necessary between the diagonal block and the other blocks of the same block-row, for the throughput to be maximum. Indeed, A path which takes 5 clock cycles imposes a gap constraint on the corresponding block-row of gap=3 since two instructions are required for every column. A gap=3 means that there will be 6 clock cycles between the first instruction of the first column, and the first instruction of the next column which requires that updated data. These 6 clock cycles are sufficient for a pipeline of length τ2=5.
For blocks outside the diagonal, the gap constraint is more relaxed. If the pipeline length for τ1 is 2 clock cycles, then no gap constraint is imposed since we already need 2 clock cycles to process each column. As mentioned in paragraph 0044, a pipeline length of 2 clock cycles for τ1 is usually sufficient to achieve high frequency for the synthesized architecture.
Encoding of Region R3
In the region R3, the encoder computes the last part of the codeword, i.e. the vector of redundancy bits s, using:
The first Ns equations of the system in Equation (31) are sufficient to get those values:
where
gather the first Ns values of vector c(K
In this disclosure, the matrix S is assumed to have full rank, and therefore is invertible. Let A=S−1 denote its inverse. The vector of redundancy bits s is computed using:
As explained in paragraph 006, A is composed of possibly large weight circulant blocks. Let wi,j denote the weight of the circulant in Ai,j. The typical weight of an inverse of a general circulant is close to L/2, which could lead to a large hardware complexity and/or a low encoding throughput. Although this invention applies to arbitrary matrices with any inverse weight, the efficiency of our encoder relies on matrices S for which the inverse is composed of low weight circulants, i.e. wi,j<<L/2, 1≤i, j≤Ns. We obtain such low weight inverse matrix by an optimized selection of the CPM shifts in S. Some examples are given in paragraphs 0049 and 0050.
This invention describes how to utilize a sparse representation of matrix A by relying on a special technique, described as follows. First, A is expanded into a rectangular matrix E, composed only of CPMs. Let us recall that A is composed of general circulant blocks, with weights wi,j, 1≤i,j≤Ns. Each block in A can be written as:
where Ai,j(k), ∀k is a CPM.
Let wmax,j=maxi {wi,j} be the maximum circulant weight in block-column j of A, 1≤j≤Ns. When the weight of a circulant is smaller than the maximum weight of the corresponding column, we further impose that Ai,j(k)=0, for wi,j≤k≤wmax,j, such that:
The expanded matrix E is composed of the concatenation of columns deduced from Equation (35):
and we denote by Ej the expansion of the j-th block-column of A, such that E=[E1, . . . , EN
An example of the expansion from A to E is shown in figure
The encoding of the redundancy bits s can be made using E instead of A, since (33) can be written as:
where cj(K
For example in the matrix E of
In a preferred embodiment of the invention, the size of the matrix S is chosen to be a multiple of the number of generalized layers γ, Ns=l γ. This preferred embodiment allows to re-use for region R3 the hardware developed for region R1 and R2, as explained in the next paragraph. Two use case of this preferred embodiment are described in paragraphs 0049 and 0050.
Hardware Realization of Encoding of Region R3
There are two steps to output the second set of redundancy bits s in region R3.
First the vector
is multiplied by the sparse representation E of inverse matrix A (37) and the resulting bits s are stored in the accumulator memories. We refer to this as stage 3a. Once computed, the values of s are read from the accumulator memories and output to 817. We refer to this second step as stage 3b.
During stage 3a, the values of s need to be iteratively updated and stored in the accumulators 803 to 806. In principle, one would need an use additional memory to store s, but here we will capitalize on the special structure of the PCM (11). Indeed, at the end of region R2 processing, thanks to the upper triangular structure of T, the temporary parity checks
are all zero, and do not need to be updated or used during the processing of region R3. We propose in this invention to use Ns/γ addresses of the available memory in each of the γ accumulator memories 811, to store the values of s. For example, one could use the memory addresses which correspond to
The processing for stage 3a is similar to the processing in region R2. The core input 814 is not used, and the main difference is that after the value of cj is read from an accumulator memory, the values of s are updated wmax,j times, instead of one update in region R2. During the wmax,j updates corresponding to Step-c in Algorithm 3, the values of cj are hold in registers in the merge module 802. In total, for stage 3a, there is then a sequence of Ns read instructions to retrieve the values of the temporary parity checks
and Σ1≤j≤N
As in region R2, we can also interleave the instructions to maximize the throughput, but with more pauses in the sequence because of the fact that there are more update instructions than read instructions. Table 2 shows an example of a sequence of interleaved instructions to calculate s, for the case of Ns=γ=4, when the inverse matrix A has maximum weights wmax,1=3, wmax,2=2, wmax,3=1 and wmax,4=4.
Once s is calculated and stored in the accumulator memories, stage 3b begins and the values of s are read from the accumulator memories and output to the stream 917. For each of the Ns values of s, one read instruction and one output instruction is required.
Region R3 Encoding with S of Size (Ns×Ns)=(γ×γ)
In a preferred embodiment of the invention, the matrix S has the size of the number of generalized layers Ns=γ. Let us recall that S must be full rank to be able compute its inverse. For this purpose, the CPM organization in S has to follow some constraints, and the values of the CPM Si,j have also to be chosen carefully.
In the preferred embodiment described in this paragraph, the CPM Si,j are chosen so that the following constraints are satisfied:
We give several examples with two different structures of such matrices, for the special case of γ=4. Generalizations to other values of γ or other organizations of the CPM in S can be easily derived from these examples. In these examples of the preferred embodiment, the matrix S has size 4×4 blocks, and therefore a minimum of 3 all-zero blocks are necessary to make it full rank.
The six CPMs (S2,3, S2,4, S3,2, S3,4, S4,2, S4,3) are chosen such that S4g has girth g, is full rank, and its inverse A has the lowest possible sum of its maximum column weight, i.e. such that Σj=14 wmax,j is minimum.
Six examples of the preferred embodiment following structures S3g and S4g are given in equations (41) to (46), for different values of the target girth g E {6, 8, 10}, and a circulant size L=32. In each example, we report the CPMs chosen in matrix S, and the matrix of circulant weights W of its inverse. Generalization to other circulant sizes L or other girths g are obtained using the same constrained design as described in this paragraph. Note the given examples are not unique, and that there exist other CPM choices that lead to the same value of Σj=14 wmax,j.
As one can see, the weights of the circulants in W are quite low, especially for girth g=6 or g=8.
Region R3 Encoding with S of Size (Ns×Ns)=(lγ×lγ)
We present in this paragraph a preferred embodiment of the invention, when the size of the matrix S is l times the number of GLs, with l>1. For ease of presentation, we present only the case of γ=4 and l=2. One skilled in the art can easily extend this example to larger values of 1, and other values of γ.
Although applicable to any matrix S of size (l γ×l γ), the implementation of region R3 described in paragraphs 0047 and 0048 can become cumbersome when the matrix S is large. This is due to the fact that the inverse of S when l≥2 is generally not sparse anymore, which leads to a large increase in computational complexity or an important decrease of the encoding throughput.
Instead, we propose in this invention to impose a specific structure on the matrix S, which allows to make a recursive use of the encoder presented in paragraph 0049 for γ×γ matrices. The proposed structure is presented in the following equation for γ=4:
where S′ and S″ are γ×γ matrices following the constraints of paragraph 0049, i.e. S′=S3g, or S′=S4g, and S″=S3g, or S″=S4g. This means in particular that S′, respectively S″, have both low weight inverses, denoted A′, respectively A″.
One can expand the inverse into their sparse representation, following Equation (36), to obtain matrices E′, respectively E″. Encoding the vector of redundancy bits s using the structure (47) requires therefore two recursive steps of the method that was presented in paragraph 0047:
and matrix E″, to obtain the vector of redundancy bits [s1; . . . ; sγ],
with the newly computed redundancy bits:
and matrix E′, to obtain the vector of redundancy bits [sγ+1; . . . ; s2γ],
At the end of these three steps, we finally obtain the vector of redundancy bits [s1; . . . ; sγ; sγ+1 . . . ; s2γ] to complete the codeword. The hardware realization of this preferred embodiment is based on the use of the hardware presented in paragraph 0048.
Encoding with Two Regions Only: R1 and R3
There are situations when the dimensions of the PCM are not sufficiently large to allow an organization in three regions as described in Equation (11). Indeed, when Ns=Mb, it results from equations (12)-(14) that Nt=0 and the region R2 does not exist. The structure of the PCM becomes:
and the codeword has only two parts, i.e. x=[u; s].
In such case, the invention presented in this disclosure consists of only two sequential steps:
The methods and hardware implementations related to regions R1 and R3 remain unchanged, and follow the descriptions of paragraphs 0043, 0044, 0047 and 0048. In the case of Ns=Mb=γ, the region R3 is implemented as in paragraph 0049, and in the case of Ns=Mb=lγ, the region R3 is implemented as in paragraph 0050.
This application claims the benefit of Provisional Application No. 62/714,607, filed Aug. 3, 2018, the entire contents of each of which are hereby incorporated by reference as if fully set forth herein.
Number | Name | Date | Kind |
---|---|---|---|
7519895 | Kyung | Apr 2009 | B2 |
8458556 | Planjery et al. | Jun 2013 | B2 |
8510624 | Kim et al. | Aug 2013 | B2 |
9331716 | Panteleev | May 2016 | B2 |
10110249 | Zhang et al. | Oct 2018 | B2 |
10530392 | Reynwar et al. | Jan 2020 | B2 |
20030126551 | Mantha et al. | Jul 2003 | A1 |
20050149842 | Kyung et al. | Jun 2005 | A1 |
20050229090 | Shen et al. | Oct 2005 | A1 |
20060036926 | Hocevar | Feb 2006 | A1 |
20070283219 | Park et al. | Dec 2007 | A1 |
20140223254 | Pisek | Aug 2014 | A1 |
20140229792 | Varnica et al. | Aug 2014 | A1 |
20170244515 | Razzetti et al. | Aug 2017 | A1 |
20180062666 | Zhang et al. | Mar 2018 | A1 |
20190044537 | Reynwar et al. | Feb 2019 | A1 |
Number | Date | Country |
---|---|---|
2 273 683 | Jan 2011 | EP |
Entry |
---|
International Search Report and Written Opinion of the International Searching Authority, PCT Application No. PCT/US19/45001, dated Oct. 29, 2019, 10 pages. |
Cai et al., “Low-Complexity Finite Alphabet Iterative Decoders for LDPC Codes,” IEEE International Symposium on Circuits and Systems, May 2013, 1332-1335. |
Cui et al., “Reduced-complexity column-layered decoding and implementation for LDPC codes,” IET Commun, 2011, 5(15): 2177-2186. |
Declercq et al., “An Imprecise Stopping Criterion Based on In-Between Layers Partial Syndromes,”IEEE Communications Letters, Jan. 2018, 22(1): 13-16. |
Declercq et al., “Approaching Maximum Likelihood decoding of finite length LDPC codes via FAID diversity,” IEEE Information Theory Workshop, 2012, 487-491. |
Declercq et al., “Finite Alphabet Iterative Decoders—Part II: Towards Guaranteed Error Correction of LDPC Codes via Iterative Decoder Diversity,” IEEE Transactions on Communications, Oct. 2013, 61(10): 4046-4057. |
Hocevar, “A Reduced Complexity Decoder Architecture via Layered Decoding of LDPC Codes,” IEEE SIPS, 2004, 107-112. |
Nguyen-Ly et al., “Analysis and Design of Cost-Effective High-Throughput LDPC Decoders,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Mar. 2018, 26(3): 508-521. |
Planjery et al., “Finite alphabet iterative decoders—Part I: Decoding Beyond Belief Propagation on the Binary Symmetric Channel,” IEEE Transactions on Communications, Oct. 2013, 61(10): 4033-4045. |
Planjery et al., “Finite alphabet iterative decoders LDPC codes surpassing floating-point iterative decoders,” Electronic Letters, Aug. 2011, 47(16): 2 pages. |
Radosavljevic et al., “Optimized Message Passing Schedules for LDPC Decoding,” IEEE Signals, Systems and Computers, Oct. 2005, 591-595. |
Sharon et al., “Efficient Serial Message-Passing Schedules for LDPC decoding,” IEEE Transactions on Information Theory, Nov. 2007, 53(11): 4076-4091. |
Vasic and Milenkovic, “Combinatorial Constructions of Low-Density Parity-Check Codes for Iterative Decoding,” IEEE Transactions on Information Theory, Jun. 2004, 50(6): 1156-1176. |
Zhang and Fossorier, “Transactions Letters: Shuffled Iterative Decoding,” IEEE Transactions on Communications, Feb. 2005, 53(2): 209-213. |
Number | Date | Country | |
---|---|---|---|
20200044667 A1 | Feb 2020 | US |
Number | Date | Country | |
---|---|---|---|
62714607 | Aug 2018 | US |