Reed-Solomon (RS) codes and Bose-Chaudhuri-Hocquenghem (BCH) codes are often employed for forward error correction (FEC) in modern communication channels, introducing sufficient redundancy to enable the receiver to correct noise-induced symbol detection errors. RS and BCH codes treat each block of data as a set of polynomial coefficients. This message polynomial is multiplied by a “generator” polynomial known to both the encoder and decoder to determine the “code word” polynomial corresponding to the message to be sent. The generator polynomial is derived based on the desired length of the code word and the desired Hamming distance between code words.
Note that RS or BCH codes (and hence the decoders) are often only one component of a sophisticated FEC strategy. For example, the International Telecommunication Union Standard ITU-T G.709.2/Y.1331.2 (07/2018) specifies an FEC strategy for the OTU4 long-reach interface (employed as part of the Optical Internetworking Forum's Implementation Agreement OIF-400ZR-01.0) that includes an error decorrelator function, a staircase FEC code, and a scrambler. The OIF-400ZR-01.0 further augments the ITU-T standard with the addition of a convolutional interleaver and an inner Hamming code. In both cases, the staircase code incorporates a BCH code as a component, necessitating the use of one or more BCH decoders at the receiving end.
RS and BCH code variations exist, such as shortening, puncturing, or extending the code messages to respectively reduce word length, to remove redundancy, or to add redundancy, while still enabling reuse of existing decoder designs. The ITU-T standard provides for the use of just such an shortened extended BCH code. This code takes a block of k=990 message bits and multiplies it with a generator or parity matrix to obtain 32 parity bits, yielding n=1022 code word bits with a minimum Hamming distance d=8. To summarize these parameters, the code may be referred to as a BCH(1022,990,8) code.
There exist many RS and BCH decoding techniques, most of which: first derive error syndrome values Si from the received version of the code word polynomial, determine a number of symbol errors, use the error syndrome values to determine coefficients of the appropriate error locator polynomial, operate on the error locator polynomial to find its roots (which indicate the error locations), calculate the error values if needed, and then correct the errors. Existing decoder techniques employ iterative procedures that are not amenable to parallelization or implementation at ever-higher data rates.
Accordingly, there are disclosed herein circuits and methods for correcting bit errors in a received version of a RS or BCH encoded bit stream. One illustrative circuit includes: a syndrome calculator, a location finder, and an error corrector. The syndrome calculator has a first array of logic gates to obtain syndrome values as a product of a receive message vector and a parity check matrix, the syndrome values including at least a first ten-bit syndrome value S1, a second ten-bit element syndrome value S2, and a third ten-bit syndrome value S3. The location finder derives a number of errors from the syndrome values, and includes a second array of logic gates to obtain two polynomial roots as a product of a syndrome value vector and a quadratic solution matrix when the number of errors is two, the quadratic solution matrix corresponding to a determination of a quadratic equation's trailing coefficient value s, a determination of the quadratic equation's roots, and a reversal of a variable substitution. The location finder further includes an index circuit to determine a bit index for each of the polynomial roots. The error corrector receives for each receive message vector a set of zero or more bit indexes representing error locations in the receive message vector.
An illustrative error correction method includes: obtaining syndrome values corresponding to a product of a receive message vector and a parity check matrix, the syndrome values including at least a first ten-bit syndrome value S1, a second ten-bit element syndrome value S2, and a third ten-bit syndrome value S3; converting the syndrome values into a set of zero or more polynomial roots representing error locations in the receive message vector; and determining a bit index for each polynomial root in the set. The converting operation includes: deriving a number of errors from the syndrome values; and when the number of errors is two, using an array of logic gates to obtain two polynomial roots corresponding to a product of a syndrome value vector and a quadratic solution matrix, the quadratic solution matrix corresponding to a determination of a quadratic equation's trailing coefficient value s, a determination of the quadratic equation's roots, and a reversal of a variable substitution.
Another illustrative error correction method includes: obtaining syndrome values corresponding to a product of a receive message vector and a parity check matrix, the syndrome values including at least a first ten-bit syndrome value S1, a second ten-bit element syndrome value S2, and a third ten-bit syndrome value S3; converting the syndrome values into a set of zero or more polynomial roots corresponding to error locations in the receive message vector; and determining a bit index for each polynomial root in the set. The converting includes: deriving a number of errors from the syndrome values; and when the number of errors is three, using a first lookup table with supporting logic gates to obtain three polynomial roots when (S5+S15)=0, the three polynomial roots corresponding to roots of a cubic equation x3+d with d representable as S3+S13, the first lookup table having a depth of no more than 341.
Each of the foregoing embodiments may be implemented individually or in combination, and may be implemented with one or more of the following features in any suitable combination: 1. the reversal of the variable substitution is representable as x=S1·t, and the trailing coefficient value is representable as
the first array of logic gates and the second array of logic gates are each formed by a set of logical AND gates to implement bitwise multiplications and a set of logical XOR gates to implement bitwise additions. 3. the syndrome values include an error parity value P. 4. the error corrector is configured to accumulate a parity delta of the parity check matrix based on the bit indexes and configured to invert receive message vector bits corresponding to the bit indexes if the parity delta indicates the bit errors are correctable. 5. the location finder further includes a first lookup table with supporting logic gates to obtain three polynomial roots when the number of errors is three and (S5+S15)=0, the three polynomial roots corresponding to roots of a cubic equation x3+d with d representable as S3+S13, and the first lookup table having a depth of no more than 341. 6. the first lookup table contains a single polynomial root or error location and the supporting logic gates derive remaining polynomial roots or error locations from the single polynomial root or error location. 7. the location finder further includes a second lookup table with supporting logic gates to obtain three polynomial roots when the number of errors is three and (S5+S15)≠0, the three polynomial roots corresponding to roots of a cubic equation x3+x+d with d representable as
and the first lookup table having a depth of no more than 170. 8. the second lookup table contains two polynomial roots or error locations and the supporting logic gates for the second lookup table derive a remaining polynomial root or error location from the two polynomial roots or error locations contained in the second lookup table. 9. the location finder is configured to provide syndrome value S1 as a polynomial root when the number of errors is one. 10. the receive message vector contains exactly 1022 bits and the parity check matrix Hcomp has 32×1022 binary elements representing powers of a primitive polynomial root in a composite field GF((25)2). 11. deriving the number of errors includes: determining that the number of errors is zero if the syndrome values are all zero; determining that the number of errors is one if syndrome value S1 is nonzero and remaining syndrome values are all zero; determining that the number of errors is two if (S5+S15)S1=(S3+S13)S3; and otherwise determining that the number of errors is three. 12. the method include verifying that roots exist for the quadratic equation if the number of roots is two by determining that a binary representation of trailing coefficient s has a bit s7=0.
While specific embodiments are given in the drawings and the following description, keep in mind that they do not limit the disclosure. On the contrary, they provide the foundation for one of ordinary skill to discern the alternative forms, equivalents, and modifications that are encompassed in the scope of the appended claims.
To provide an illustrative context for understanding the myriad applications of the disclosed decoders,
To enable robust performance over even extended cable lengths, each connector 100, 101 may include a powered transceiver that performs electrooptical signal conversion combined with clock and data recovery (CDR) and re-modulation of data streams in each direction. The powered transceivers are also known as data recovery and re-modulation (DRR) devices. In at least one contemplated embodiment, the cable connectors 100, 101 are quad small form-factor pluggable double density (QSFP-DD) transceiver modules that exchange 400GAUI-8 data streams with the host.
In at least some contemplated embodiments, the printed circuit boards each also support a micro-controller unit (MCU) 206. Each DRR device 202, 204 is coupled to a respective MCU device 206 which configures the operation of the DRR device via a first two-wire bus. At power-on, the MCU device 206 loads equalization parameters and/or other operating parameters from Flash memory 207 into the DRR device's configuration registers 208. The host device can access the MCU device 206 via a second two-wire bus that operates in accordance with the I2C bus protocol and/or the faster MDIO protocol. With this access to the MCU device 206, the host device can adjust the cable's operating parameters and monitor the cable's performance.
Each DRR device 202, 204, includes a set 220 of transmitters and receivers for communicating with the host device and a set 222 of transmitters and receivers for communications via the optical transceivers and intervening optical fibers. The illustrated host-facing transceivers 220 support eight lanes (400GAUI-8) for bidirectional communication with the host device. In other contemplated embodiments, the host-facing transceivers 220 support other data rates and lane configurations. The DRR devices include a memory 224 to provide first-in first-out (FIFO) buffering between the transmitter & receiver sets 220, 222. An embedded controller 228 coordinates the operation of the transmitters and receivers by, e.g., setting initial equalization parameters and ensuring the training phase is complete across all lanes and links before enabling the transmitters and receiver to enter the data transfer phase. The embedded controller 228 employs a set of registers 208 to receive commands and parameter values, and to provide responses potentially including status information and performance data.
The illustrative cable of
The Application Layer 308 is the uppermost layer in the model, and it represents the user applications or other software operating on different systems that need a facility for communicating messages or data. The Presentation Layer 310 provides such applications with a set of application programming interfaces (APIs) that provide formal syntax along with services for data transformations (e.g., compression), establishing communication sessions, connectionless communication mode, and negotiation to enable the application software to identify the available service options and select therefrom. The Session Layer 312 provides services for coordinating data exchange including: session synchronization, token management, full-or half-duplex mode implementation, and establishing, managing, and releasing a session connection. In connectionless mode, the Session Layer may merely map between session addresses and transport addresses.
The Transport Layer 314 provides services for multiplexing, end-to-end sequence control, error detection, segmenting, blocking, concatenation, flow control on individual connections (including suspend/resume), and implementing end-to-end service quality specifications. The focus of the Transport Layer 314 is end-to-end performance/behavior. The Network Layer 316 provides a routing service, determining the links used to make the end-to-end connection and when necessary acting as a relay service to couple together such links. The Data link layer 318 serves as the interface to physical connections, providing delimiting, synchronization, and sequence and flow control across the physical connection. It may also detect and optionally correct errors that occur across the physical connection. The Physical layer 322 provides the mechanical, electrical, functional, and procedural means to activate, maintain, and deactivate channels 306, and to use the channels 306 for transmission of bits across the physical media.
The Data Link Layer 318 and Physical Layer 322 are subdivided and modified slightly by IEEE Std 802.3-2015, which provides a Media Access Control (MAC) Sublayer 320 in the Data Link Layer 318 to define the interface with the Physical Layer 322, including a frame structure and transfer syntax. Within the Physical Layer 322, the standard provides a variety of possible subdivisions such as the one illustrated in
The optional Reconciliation Sublayer 324 merely maps between interfaces defined for the MAC Sublayer 320 and the PCS Sublayer 326. The PCS Sublayer 326 provides alignment marker insertion/removal, FEC, and framing with synchronization and training sequences. The PMA Sublayer 330 provides symbol encoding/decoding, filtering, conversion between analog and digital signal formats. The PMD Sublayer 332B specifies the optical transceiver conversions between transmitted/received channel signals and the corresponding electrical signals. A receptacle 336 is also shown as part of the PMD sublayer 332 to represent the physical network interface port.
The connectors 100, 101, have plugs 200, 201 representing edge connectors that mate with the receptacles 336 of the two host devices 302, 304. Within each connector, the DRR devices may implement a host-facing Physical Layer 322A, a center-facing Physical Layer 322B, and a Data Link Layer 340 that bridges together the two Physical Layers. In some embodiments, one or more of the internal sublayers within each connector (e.g., PCS, Reconciliation, MAC) are bypassed or omitted entirely to reduce areal requirements and/or to reduce power. More information regarding the operation of the sublayers, as well as the electrical and physical specifications of the connections between the nodes and the communications medium (e.g., pin layouts, line impedances, signal voltages & timing), and the physical specifications for the communications medium itself (e.g., limitations on attenuation, dispersion), can in many cases be found in the current standard, and any such details should be considered to be well within the knowledge of those having ordinary skill in the art.
Generic mapping procedure (GMP) modules 404, 405 each provide a transition between the local system clock domain and the line clock domain, typically providing word padding to accommodate mismatches in clock rates. An alignment marker insertion module 406 provides alignment markers that enable alignment between different lanes of the data stream. Detector module 407 detects and removes the alignment markers. Note that these modules may cooperate with alignment marker insertion/removal modules in other sublayers to preserve alignment marker content across the transcoding process.
A Cyclic Redundancy Check calculation module 408 adds checksum information to the data stream, which is verified on the receiving side by a check module 409 to detect data corruption. An error decorrelation interleaver module 410 redistributes symbols of the data stream, an operation that is reversed by de-interleaver module 411 on the receive side. Staircase encoder module 412 implements a staircase code having BCH component codes. Staircase decoder module 413 reverses this operation as described further below.
Module 414 appends pad bits to align the staircase code words with 400ZR frame boundaries of 400ZR, and applies a predefined scrambling mask. Module 415 reverses these operations on the receive side. A convolutional interleaver 416 disperses bits from the code words to increase resilience to burst errors. De-interleaver module 417 reverses this operation on the receive side.
Hamming encoder module 418 applies an double-extended Hamming Code SD-FEC (128,119) to provide additional net coding gain. Module 419 performs the decoding operation on the receive side. Symbol mapper module 420 maps the coded data stream eight bits at a time to dual polarization 16-point quadrature amplitude modulation (DP-16QAM) symbols, which are then time interleaved by symbol interleaver module 422. The receive chain includes corresponding modules 421, 423 to reverse these operations.
To aid with timing synchronization and clock recovery, module 424 provides a frame alignment word (FAW) at the beginning of each super-frame, which includes 49 subframes each having a training sequence provided by module 424. Module 424 further inserts a pilot symbol every 32 symbols. In the receive chain, module 425 uses the FAW to detect the alignment of super-frames before removing the pilot symbols, training sequences, and FAWs.
Modules 426, 427 represent the PMA sublayer, in which digital signal processing may be performed for spectrum shaping and signal equalization and conversion between digital and analog signal domains. On the receive side, module 427 performs clock recovery as part of the analog-to-digital conversion. Modules 428, 429 represent the PMD sublayer, in which the optical transceivers convert electrical transmit signals to optical signals in the fiber and convert received optical signals to electrical receive signals on the receive side.
The operation of decoder module 413 is illustrated by the block diagram in
Memory 502 may be organized as bytes or words, causing the memory locations to be accessed in an easy-to-read fashion during initial processing of the block but to be accessed in a more distributed fashion during the subsequent processing of that block. To facilitate subsequent processing, the decoders 506 may form transpositions of each block during the initial processing, storing the transposed blocks in set of transpose memory blocks 512. Further implementation details can be found in the literature, including, e.g., D. Truhachev et al., “Efficient Implementation of 400 Gbps Optical Communication FEC”, IEEE Trans. Circuits & Systems-I, V68n1, Jan. 2021.
It is desirable for each of the BCH decoders 506 to be implemented as efficiently as possible to facilitate high throughput with minimal power consumption.
Conventional implementations of the location finder 604 employ iterative procedures for calculating and factoring the error location polynomial, e.g., Berlekamp-Massey algorithm with a Chien search. It is challenging to accommodate such iteration when decoding high bandwidth data streams. The enhanced decoding method disclosed below may offer at least four potential advantages over existing decoder implementations: (1) the properties of the trace function are exploited for a fast, non-LUT based, determination whether the error polynomial has any roots; (2) the error polynomial factorization operations are simplified through the use of a composite field; (3) the basis conversion to the composite field is precomputed and thus incurs no overhead; and (4) the LUT for finding the cubic roots is made much smaller.
Before describing the enhanced decoding method and decoder of
The underlying field of the (1022, 990) extended BCH code used in the ITU-T G.709.2/Y.1331.2 standard, is GF(210), which can be constructed by the primitive polynomial p(x)=x10+x3+1. Let α be the root of the primitive polynomial p(x). The non-zero field elements of GF(210) can be represented as αi, 0≤i≤1022, which we refer to as the “power” representation, where α1023=α0=1. A consequence of the field's construction is that its elements can be expressed as a weighted sum of the first ten powers, i.e., we can write αi=b9α9+b8α8+ . . . +b0, 0≤i≤1022, where the coefficients are binary. For indexing convenience, we refer to the integer l=b929+b828+ . . . +b0 as the “binary” representation of αi . We define log(l)=i.
BCH encoding is accomplished using a generator matrix defined in terms of these field elements. To enable systematic encoding (where the code words include the original data word concatenated with a set of parity symbols), the generator matrix columns may be permuted to make a portion of the generator matrix resemble an identity matrix. The standard achieves this result by defining a permutation function Πd on the integers i,0≤i≤509. In the following, Πd(M:M+N)=K:K+N is a shorthand for Πd(M)=K,Πd(M+1)=K+1, . . . ,Πd(M+N)=K+N. Values of Πd are specified via the following Table 1.
Consider the function f(i) which maps an integer i,1≤i≤1023, to the column vector
where βi=αlog(i), and F(βi)=(b2l&
Then replace each element in the first three rows of H by its corresponding 10-bit binary representation l and perform elementary row operations (over GF(2)) on H to obtain its row-reduced echelon form with the identity matrix on the right HENC=[PT; I]. The generator matrix is then G=[I; P], where the resulting 990×32 matrix P provides the encoder's parity-generating matrix.
The encoder may be implemented as a multiplication of a 1×990 vector (the 990-bit message) by a 990×1022 matrix (the generator matrix). The matrix element multiplications and additions are performed over GF(2), where AND gate and XOR gates respectively correspond to multipliers and adders. The encoder can thus be implemented as an AND-XOR gate array.
The multiplicative inversion of GF(210) via a polynomial basis constructed via p(x)=x10+x3+1 is complicated. If it is implemented via a look up table (LUT), the depth of LUT will be 1023 such that the critical path can be deteriorated. However, GF(210) is isomorphic to the composite field GF((25)2), enabling us to exploit the composite field to simplify the multiplicative inversion. Additionally, the composite field simplifies the solving of quadratic equations, since operations in the composite field GF((25)2) can be decomposed to several operations in the sub-field GF(25).
To use composite field in the (1022, 990) extended BCH encoder/decoder of the ITU-T standard, we must consider basis conversion. We note that the generating polynomial of (1022, 990) BCH code can be the same for both GF(210) in polynomial basis constructed via p(x)=x10+x3+1 and the composite field, because the minimal polynomial of α, α3, α5 can be the same, i.e., it doesn't depend on what underlying fields are chosen.
Hence, (1022, 990) BCH encoding with composite field representation is the same as the one with polynomial representation defined in ITU-T standard. For decoder, however, we don't use H matrix defined in the standard directly. Rather, we convert H matrix to Hcom in composite field. The conversion can be performed as follows:
The conversion between H and Hcom is pre-computed, i.e., no extra hardware cost is needed. The decoding, including syndrome calculation, error location polynomial finding, and error location polynomial factorization can be done over the composite field GF((25)2). Next, we will introduce the composite field construction.
Let's first review the algebraic structure of a composite field GF(2mn). GF(2mn) can be constructed via an irreducible polynomial of degree m over GF(2n), where GF(2n) is called the ground field which in turn can be constructed via an irreducible polynomial of degree of n over GF(2). When gcd(m,n)=1, we can use two irreducible polynomials over GF(2) to construct the composite field. For GF((25)2), we can choose f5(x)=x5+x2+1 to construct the sub-field GF(25) over GF(2) and f2(x)=x2+x+1 to construct GF((25)2) over GF(25).
Though it is not the case here, we note that if n is 2 and m is an even number, we can't use the irreducible polynomial f2(x)=x2+x+1 to construct GF((2m)2) over GF(2m). Instead, we could find an irreducible polynomial in the form f2(x)=x2+x+θi, where θ is the primitive element in GF(2m), by computing the trace value of Tr(θi) for various values of i. If Tr(θi)=1, f2(x)=x2+x+θi is irreducible over GF(2m), otherwise it is reducible. We can choose the irreducible polynomial with the lowest Hamming weight value of θi to construct GF((2m)2) over GF(2m).
Because we are contemplating performing our operations over GF(25), we note the existence of efficient solutions for performing generic multiplication of low Hamming weight polynomials, such as trinomials or pentanomials, over GF(2n). The System Verilog function is:
Accordingly, multiplication of such polynomials in GF(25) can be implemented via an arrangement of logical AND (“&”) and logical OR (“{circumflex over ( )}”) gates pursuant to the above method with N=5.
Now we introduce multiplication in the composite field multiplication. Let β be the root of an irreducible polynomial f2(x)=x2+x+1, i.e., β2=β+1. Let a,b,c ∈GF((25)2), c=a·b, where a=a1β+a0, b=b1β+b0, c=c1β+c0, and a0, a1, b0, b1, c0, and c1∈GF(25). We can derive the formula for computing coefficients c0 and c1:
By Karatsuba's method,
For GF(2n) with a polynomial basis representation, we can use the following formula to compute the square of a field element. Let a∈GF(2n), and a=Σi=0n−1 aiαi, where ai∈GF(2), and α is the polynomial basis generator of GF(2n).
Next, we use the field generating polynomial to perform reduction. For GF(25) constructed via f5(x)=x5+x2+1. Let a=a4α4+a3α3'a2α2+a1α+a0, and c=a2=a4α8+a3α6+a2α4+a1α2+a0. By α8=α5·α3=(α2+1)·α3=α3+α2+1 and α6=α3+α, we have,
For the square of a composite field element,
Turning now to square root operations in the composite field, we first review the algorithm for computing the square root of a non-zero element in GF(25). Let a,c∈GF(25) and c=a2. We use the vector form [a0, a1, a2, a3, a4] and [c0, c1, c2, c3, c4] to represent a and c, where da0˜a4 and c0˜c4∈GF(2). The square in GF(25) can be treated as a 1×5 vector multiplied by a 5×5 matrix, i.e., [c0, c1, c2, c3, c4]=[a0, a1, a2, a3, a4]·A, where
Any element has only one square root, i.e., matrix A is invertible to
by which we have
Now let's derive the formula for computing the square root of a non-zero element in GF((25)2). By Equation 3, we have:
We turn now to showing how to perform multiplicative inversion in the composite field. In GF(2n), any non-zero field element can be represented as the power of the
primitive element, namely A 31-depth LUT can be used to implement the multiplicative inversion in GF(25). For multiplicative inversion in the composite field GF((25)2), let a, b∈GF((25)2), a=a1β+a0, b=b1β+b0, where a0, a1, b0, and b1∈GF(25). a0 and a1 can't be both zeros. By a·b=1, we can get two linear equations for b0 and b1:
By solving Equations 5 & 6, we have:
Thus, composite field GF((25)2) inversion can be implemented via a set of GF(25) multipliers, squarers, and one GF(25) inverter (which can be implemented via 31-depth LUT).
With the foregoing groundwork done, we can efficiently determine the error location polynomial. And rather than relying on the Chien search algorithm to factor the polynomial, which would impose a latency proportional to the code size, we note that the correction power of the (1022, 990) extended BCH code is 3, limiting the degree of the error location polynomial to three. Fast factorization can be achieved by solving the cubic, quadratic, or linear polynomial equations in the composite field.
We start with solving quadratic equations. The equation x2=a, where a∈GF(2n), always has a double root in a∈GF(2n). For a receive message having two errors, the error locator polynomial will take the form x2+ax+b=0, where a,b∈GF(2n), a≠0. This form can be converted to the following equation by replacing x=a·t
Equation 9 has two roots, r and r+1, if and only if TrGF(2
Before showing how to solve Equation 9 in the composite field, we examine the solution in GF(210), which is constructed via p(x)=x10+x3+1. The GF(210) element s can be represented via the polynomial basis {1, α, α2, . . . , α9} as s=Σi=o9siαi, where α is the root of p(x) and si∈GF(2). It can be shown that
is zero for all i≠7, meaning that
If there exist roots in GF(210) for the equation x2+x+s=0, then TrGF(2
When the coefficient is zero, let x1 be one root. Then the other root is x2=x1+1. By the Hilbert constructive method:
The α7·2
In the case where the underlying field is the composite field GF((25)2), s∈GF((25)2) and it can be represented as s=s1β+s0, where s0,s1∈GF(25). Representing one root as c=c1β+c0, we have the equation, c12β2+c02+c1β+c0=s1β+s0. By replacing β2 with β+1,
Equations 14 and 15 are quadratic equations in the subfield GF(25), making them simpler to solve. Equation 14 either has two different roots in GF(25) or it has no roots in GF(25). If Equation 14 has no roots in GF(25), there will be no roots in the composite field for Equation 9. Assuming the roots exist, one root is r and the other is r+1.
Replace c1 in Equation 15 with r and r+1. One of
is zero, and the other must be 1. Keep the root satisfying
Given the selected root, we can also find two roots of c0 for Equation 12 given the selected root of c1.
A depth-31 LUT could be used to determine r, and hence r+1, for Equations 14and 15, but a simpler method is available for solving the quadratic equation, x2+x+s=0 in GF(25) (the method is based on Hilbert's Theorem 90).
If the equation x2+x+s=0 has roots in GF(25), then Tr(s)=0. Proof: By applying trace mapping on both sides of the equation,
in GF(25), then
is one root of the equation, and the other root is
Using the groundwork above, the roots can be found using an arrangement of logic gates. We don't need any LUTs to solve x2+x+s=0 in either GF(25) and GF((25)2). Once these roots are found, the variable substitution for Equation 9 can be reversed to determine the roots of the quadratic error location polynomial. These computations can be expressed as a multiplication of the coefficient vector s with a 10×10 matrix whose entries are GF(2) elements.
Having chosen the irreducible polynomial f5(x)=x5+x2+1 to construct GF(25), the trace calculation to determine whether roots exist is easy. Let α be the root of f5(x). Then the field element, a can be represented via polynomial basis {1, α, α2, α3, α4} as a=Σi=04aiαi. As Tr(α)=Tr(α2)=Tr(α4)=0 and Tr(1)=Tr(α3)=1, Tr(a)=a0+a3.
Turning now to factoring cubic equations in the composite field GF((25)2), we only need to consider two types of cubic equations: x3+d=0, where d≠0, and x3+x+d=0. This simplification results from an ability to convert the general cubic equation, x3+ax2+bx+c=0, a≠0, a, b, c∈GF((25)2), to these two types. The conversion is done by variable substitution, replacing x with t+a, yielding t3+(a2+b)t+(ba+c)=0.
If a2+b=0, we have converted the general cubic equation to the first type. Otherwise, when a2+b+0, replace t with sy, yielding s3y3+(a2+b)sy+(ba+c)=0. Divide this by s3 to get
such that
Then we have the second cubic equation type with y3+y+d=0.
For the first type, either we can find 3 different roots or no roots in GF((25)2). If we find a root r, the other two roots will be rβ and rβ2, where β2+3+1=0, (β3=1). Let r=r1β+r0, then we have
We can use LUT to save only one root, enabling the use of a LUT with a depth of only 341. The other two roots can be calculated via Equations 16 and 17 by an arrangement of logic gates. The logic gate arrangement can then reverse the variable substitution to provide the roots of the original cubic equation.
In GF(25)2), for the second type of cubic equation, x3+x+d=0:
Having covered techniques for efficiently finding roots of quadratic and cubic equations, we turn to a discussion of the decoder implementation, which begins with the syndrome calculator 602.
Though the composite field GF((25)2) is isomorphic to GF(210), the H matrix defined in ITU standard is not suitable for performing syndrome calculation. Instead, we should use Hcomp for syndrome calculation. For pre-computation of Hcomp, we find the roots of the primitive polynomial p(x)=x10+x3+1 defined in ITU standard in the composite field GF((25)2), which can be done by brute force method using the composite field multiplication defined above. There are 10 conjugate roots that can be found. For simplicity, we take the root with lowest Hamming weight to be the selected α, the root of p(x). In Verilog notation, it is 10′h0C1. With the composite field representation of α, we can compute αi, α3i, and α5i for each column of Hcomp. (The last two rows of Hcomp are the same as those of the ITU standard's H matrix. Then we can calculate the syndrome of received word via the following equation,
where {right arrow over (r)}=[r0, r1, . . . , r1021] denotes the binary bitstream of the received word. We can choose either partial parallel architecture or full parallel architecture to implement it. As an example of for a partial parallel architecture, we can scan 256 bits each cycle (254 bits for the last cycle) for 4 cycles, accumulating the partial sum in each cycle to the syndrome register. For the full parallel architecture, the syndrome calculator takes 1022-bit inputs. If it is difficult to complete such a big XOR summation within one cycle, we can insert 2 or 3 pipeline stage registers to shorten the critical path.
As (1022, 990) extended BCH code's error correction capability is only 3, the location finder 604 need not use the Berlekamp-Massey algorithm to find the error location polynomial and need not use the Chien search algorithm to factor the error location polynomial. Rather, the error location polynomial can be calculated based on the syndrome. Once we find the error location polynomial, we can solve the linear, quadratic, or cubic equations as provided above to factor the error location polynomial and thereby determine the error locations.
The syndrome calculator 602 feeds a 32-bit syndrome vector to the location finder 604. The first 30 bits are three GF((25)2) evaluation results of the received polynomial at αi, α3i, and α5i. The last 2-bits are for extra parity checking to lower the miscorrection rate. With an error correction capability of 3, there are four possibilities to consider for the number of errors in the received message: 0, 1, 2, or 3. The error corrector can determine which of the four possibilities applies based on the syndrome.
Refining the approach outlined by Truhachev et al., “Efficient Implementation of 400 Gbps Optical Communication FEC”, IEEE Trans. Circuits & Systems, v68n1, Jan 2021, we first define the error location polynomial as follows, which is the reciprocal of the one defined in most textbooks. Let eloc(X)=Π(X+Xi), where Xi=αk
is invertible.
To summarize, the conditions for the above 4 cases are:
By now, we have shown how to calculate error location polynomials according to three 10-bit syndromes. By the method proposed in Section 2, if we found that the number of roots is less than the degree of the error location polynomial, we can justify that the received word can't be decodable.
With the foregoing in mind,
Block 802 represents a determination of the syndrome values S1, S3, S5, P by syndrome calculator 602. This determination corresponds to a multiplication of the receive message vector by the Hcomp matrix, which is implementable by an array of logic gates (AND gates for multiplications and XOR gates for summing the products). Though shown as being contingent on later-described tests, blocks 812, 820, and 838 may be speculatively performed ahead of time by syndrome calculator 602, also via an array of logic gates that carry out the addition, multiplication, power, root, and inverse multiplication operations to determine the values of the relevant variables.
Block 804 represents a test by location finder 604 to determine if the syndrome values S1, S3, S5, are all zero, corresponding to the case where no correctable errors are present. Though shown as being contingent, blocks 806, 814, 822, and 832 may be speculatively performed by location finder 604 in parallel with block 804 to determine which error locating procedure should be performed. If syndrome values S1, S3, S5, are all zero, location finder 604 checks whether the parity value P is zero in block 806. If not, the receive message vector is flagged in block 808 as having uncorrectable errors. Regardless of whether there are no errors or the errors are uncorrectable, the receive message vector is produced as output data in block 810.
Block 812 represents the determination of values D3=S3+S13and D5=S5+S15. Block 814 represents a test of whether these values are both zero, corresponding to the case where the receive message contains a single error. If so, block 816 represents the location finder's determining of the single error location from the S1 syndrome value. As with the locations corresponding to roots of any of the polynomials, an indexing circuit can determine the bit index of an error in the receive message vector by comparing the first 10 bits of each column of Hcom with the root. If the root equals the first 10 bits, the column index (which equals the bit index of the receive message vector) is an error location. A parity verification can be performed by accumulating the last two bits of the Hcom columns at the error locations to calculate the parity delta syndrome. If the accumulated parity delta syndrome for all error locations is equal to the syndrome P value, then the errors are correctable.
Block 818 represents correction of the located error(s) by inverting the receive message bit at each error location. Though not expressly shown here, this correction may be made contingent on the determination that the located errors are correctable.
Block 820 represents the calculation of the product values D5S1 and D3S3. Block 822 represents a test of whether these values are equal, corresponding to the case where the receive message contains two errors. If so, block 824 represents the location finder's determination of coefficients for the quadratic error locator polynomial and use of variable substitution to obtain the trailing coefficient for Equation 9.
Using the principles described previously, an array of logic gates can be used to obtain a product of a syndrome value vector with a quadratic solution matrix to obtain the two roots corresponding to the two error locations. The solution matrix can incorporate a reversal of the variable substitution used to obtain Equation 9. Block 826 represents the indexing circuit using the roots to find the matching Hcom column indices, which equal the bit indices of the receive message vector where the errors can be found and corrected in block 818 if the errors are determined to be correctable.
If blocks 804, 814, and 822, all yield negative tests, the receive message vector contains at least three errors. Block 832 represents a test by the location finder 604 to determine if the variable-substituted cubic polynomial is a cubic of the first type, in which case the trailing coefficient value is d=D3=S3+S13. In block 834 the location finder uses the depth-341 lookup table to obtain one of the three roots and uses it to calculate the other two roots. Supporting logic gates can be used to reverse the variable substitution, i.e., to add S1 to each root. Block 836 represents the indexing circuit using the roots to find the matching Hcom column indices, which equal the bit indices of the receive message vector where the errors can be found and corrected in block 818 if the errors are determined to be correctable.
If the block 832 test is negative, the variable-substituted cubic is of the second type. Block 838 represents the calculation of the trailing coefficient value
In block 840 the location finder uses the depth-170 lookup table to obtain two of the three roots, combining them to calculate the third root. Supporting logic gates can be used to reverse the variable substitution, i.e., multiplying each of them by
Though operations are described sequentially, it should be understood that the operations can be reordered, parallelized, and/or combined. For example, block 824 can be implemented as an array of logic gates to obtain two polynomial roots as a product of a syndrome value vector and a quadratic solution matrix. The quadratic solution matrix may combine the operations corresponding to a determination of a quadratic equation's trailing coefficient value s, a determination of the quadratic equation's roots, and a reversal of a variable substitution. Any sequence of operations that can each be expressed as a linear transformation can be combined into a single combined linear transformation.
Each of the BCH decoder components (syndrome calculator, error location finder, and error correction logic) can be pipelined. The syndrome calculator operates to compute the product of 1×1022 vector and 1022×32 matrix. If all 1022 input bits are available, the calculation can be performed via an AND-XOR gate array. The XOR gates can be organized in a binary tree. One contemplated implementation uses 3 pipeline stage registers to shorten the critical path. One contemplated implementation of the location finder, including error location polynomial calculation and factorization of the error location polynomial, is pipelined with 4 stage registers. If the number of roots found in the underlying field is smaller than the degree of the error location polynomial, the locator designates the errors as uncorrectable. Otherwise, the error correction logic will perform further check with the parity bits (the last two extended syndrome bits), accumulating the last two bits of the selected error location columns of the check matrix Hcom to obtain delta syndromes. The error correction logic takes 4 cycles to complete this further check. If the decoder is parallel out, i.e., all 1022 output bits are sent out simultaneously, the error correction logic needs only one cycle to correct the error bits. Hence, the latency of the decoder is 12 cycles.
This BCH decoder design was synthesized using a commercially available technology library. The target frequency was selected as 937.5 MHz with 20% margin. The synthesis tool used was Synopsis Design Compiler. The total cell area of BCH decoder came out to 9311 um{circumflex over ( )}2. Standard voltage threshold (SVT) device area is 91%, low voltage threshold (LVT) device area is 7.54%, and ultra-low voltage threshold (ULVT) device area is 1.45%.
Numerous alternative forms, equivalents, and modifications will become apparent to those skilled in the art once the above disclosure is fully appreciated. It is intended that the claims be interpreted to embrace all such alternative forms, equivalents, and modifications that are encompassed in the scope of the appended claims.