FAST EFFICIENT DECODER FOR LOW DISTANCE RS AND BCH CODES

BACKGROUND

Reed-Solomon (RS) codes and Bose-Chaudhuri-Hocquenghem (BCH) codes are often employed for forward error correction (FEC) in modern communication channels, introducing sufficient redundancy to enable the receiver to correct noise-induced symbol detection errors. RS and BCH codes treat each block of data as a set of polynomial coefficients. This message polynomial is multiplied by a “generator” polynomial known to both the encoder and decoder to determine the “code word” polynomial corresponding to the message to be sent. The generator polynomial is derived based on the desired length of the code word and the desired Hamming distance between code words.

Note that RS or BCH codes (and hence the decoders) are often only one component of a sophisticated FEC strategy. For example, the International Telecommunication Union Standard ITU-T G.709.2/Y.1331.2 (07/2018) specifies an FEC strategy for the OTU4 long-reach interface (employed as part of the Optical Internetworking Forum's Implementation Agreement OIF-400ZR-01.0) that includes an error decorrelator function, a staircase FEC code, and a scrambler. The OIF-400ZR-01.0 further augments the ITU-T standard with the addition of a convolutional interleaver and an inner Hamming code. In both cases, the staircase code incorporates a BCH code as a component, necessitating the use of one or more BCH decoders at the receiving end.

RS and BCH code variations exist, such as shortening, puncturing, or extending the code messages to respectively reduce word length, to remove redundancy, or to add redundancy, while still enabling reuse of existing decoder designs. The ITU-T standard provides for the use of just such an shortened extended BCH code. This code takes a block of k=990 message bits and multiplies it with a generator or parity matrix to obtain 32 parity bits, yielding n=1022 code word bits with a minimum Hamming distance d=8. To summarize these parameters, the code may be referred to as a BCH(1022,990,8) code.

There exist many RS and BCH decoding techniques, most of which: first derive error syndrome values S_ifrom the received version of the code word polynomial, determine a number of symbol errors, use the error syndrome values to determine coefficients of the appropriate error locator polynomial, operate on the error locator polynomial to find its roots (which indicate the error locations), calculate the error values if needed, and then correct the errors. Existing decoder techniques employ iterative procedures that are not amenable to parallelization or implementation at ever-higher data rates.

SUMMARY

Accordingly, there are disclosed herein circuits and methods for correcting bit errors in a received version of a RS or BCH encoded bit stream. One illustrative circuit includes: a syndrome calculator, a location finder, and an error corrector. The syndrome calculator has a first array of logic gates to obtain syndrome values as a product of a receive message vector and a parity check matrix, the syndrome values including at least a first ten-bit syndrome value S₁, a second ten-bit element syndrome value S₂, and a third ten-bit syndrome value S₃. The location finder derives a number of errors from the syndrome values, and includes a second array of logic gates to obtain two polynomial roots as a product of a syndrome value vector and a quadratic solution matrix when the number of errors is two, the quadratic solution matrix corresponding to a determination of a quadratic equation's trailing coefficient value s, a determination of the quadratic equation's roots, and a reversal of a variable substitution. The location finder further includes an index circuit to determine a bit index for each of the polynomial roots. The error corrector receives for each receive message vector a set of zero or more bit indexes representing error locations in the receive message vector.

An illustrative error correction method includes: obtaining syndrome values corresponding to a product of a receive message vector and a parity check matrix, the syndrome values including at least a first ten-bit syndrome value S₁, a second ten-bit element syndrome value S₂, and a third ten-bit syndrome value S₃; converting the syndrome values into a set of zero or more polynomial roots representing error locations in the receive message vector; and determining a bit index for each polynomial root in the set. The converting operation includes: deriving a number of errors from the syndrome values; and when the number of errors is two, using an array of logic gates to obtain two polynomial roots corresponding to a product of a syndrome value vector and a quadratic solution matrix, the quadratic solution matrix corresponding to a determination of a quadratic equation's trailing coefficient value s, a determination of the quadratic equation's roots, and a reversal of a variable substitution.

Another illustrative error correction method includes: obtaining syndrome values corresponding to a product of a receive message vector and a parity check matrix, the syndrome values including at least a first ten-bit syndrome value S₁, a second ten-bit element syndrome value S₂, and a third ten-bit syndrome value S₃; converting the syndrome values into a set of zero or more polynomial roots corresponding to error locations in the receive message vector; and determining a bit index for each polynomial root in the set. The converting includes: deriving a number of errors from the syndrome values; and when the number of errors is three, using a first lookup table with supporting logic gates to obtain three polynomial roots when (S₅+S₁⁵)=0, the three polynomial roots corresponding to roots of a cubic equation x³+d with d representable as S₃+S₁³, the first lookup table having a depth of no more than 341.

Each of the foregoing embodiments may be implemented individually or in combination, and may be implemented with one or more of the following features in any suitable combination: 1. the reversal of the variable substitution is representable as x=S₁·t, and the trailing coefficient value is representable as

$s = \frac{s_{3}}{s_{1}^{3}} + 1.2 .$

the first array of logic gates and the second array of logic gates are each formed by a set of logical AND gates to implement bitwise multiplications and a set of logical XOR gates to implement bitwise additions. 3. the syndrome values include an error parity value P. 4. the error corrector is configured to accumulate a parity delta of the parity check matrix based on the bit indexes and configured to invert receive message vector bits corresponding to the bit indexes if the parity delta indicates the bit errors are correctable. 5. the location finder further includes a first lookup table with supporting logic gates to obtain three polynomial roots when the number of errors is three and (S₅+S₁⁵)=0, the three polynomial roots corresponding to roots of a cubic equation x³+d with d representable as S₃+S₁³, and the first lookup table having a depth of no more than 341. 6. the first lookup table contains a single polynomial root or error location and the supporting logic gates derive remaining polynomial roots or error locations from the single polynomial root or error location. 7. the location finder further includes a second lookup table with supporting logic gates to obtain three polynomial roots when the number of errors is three and (S₅+S₁⁵)≠0, the three polynomial roots corresponding to roots of a cubic equation x³+x+d with d representable as

${\frac{{(s_{3} + s_{1}^{3})}^{5}}{{(s_{5} + s_{1}^{5})}^{3}}}^{1 / 2},$

and the first lookup table having a depth of no more than 170. 8. the second lookup table contains two polynomial roots or error locations and the supporting logic gates for the second lookup table derive a remaining polynomial root or error location from the two polynomial roots or error locations contained in the second lookup table. 9. the location finder is configured to provide syndrome value S₁as a polynomial root when the number of errors is one. 10. the receive message vector contains exactly 1022 bits and the parity check matrix H_comphas 32×1022 binary elements representing powers of a primitive polynomial root in a composite field GF((2⁵)²). 11. deriving the number of errors includes: determining that the number of errors is zero if the syndrome values are all zero; determining that the number of errors is one if syndrome value S₁is nonzero and remaining syndrome values are all zero; determining that the number of errors is two if (S₅+S₁⁵)S₁=(S₃+S₁³)S₃; and otherwise determining that the number of errors is three. 12. the method include verifying that roots exist for the quadratic equation if the number of roots is two by determining that a binary representation of trailing coefficient s has a bit s₇=0.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a perspective view of an illustrative active Ethernet cable.

FIG. 2 is a function-block diagram of an illustrative AEC.

FIG. 3 is an architectural diagram for a communications link including the illustrative cable.

FIG. 4 is a block diagram of an illustrative transmit chain and its corresponding receive chain.

FIG. 5 is a block diagram of an illustrative staircase code decoder.

FIG. 6 is a block diagram of an illustrative BCH decoder.

FIG. 7 shows the relationship of underlying fields to a composite field.

FIG. 8 is a flow diagram of an illustrative BCH decoding method.

DETAILED DESCRIPTION

While specific embodiments are given in the drawings and the following description, keep in mind that they do not limit the disclosure. On the contrary, they provide the foundation for one of ordinary skill to discern the alternative forms, equivalents, and modifications that are encompassed in the scope of the appended claims.

To provide an illustrative context for understanding the myriad applications of the disclosed decoders, FIG. 1 is a perspective view of an illustrative 400 Gbps cable that may be used to provide a high-bandwidth communications link between devices in a routing network such as that used for data centers, server farms, and interconnection exchanges. The routing network may be part of, or may include, for example, the Internet, a wide area network, or a local area network. The linked devices may be computers, switches, routers, and the like. The cable includes a first connector 100 and a second connector 101 that are connected via one or more optical fibers in a cord 106. The one or more optical fibers may each be configured for unidirectional or bidirectional communication of wave division multiplexed optical signals.

To enable robust performance over even extended cable lengths, each connector 100, 101 may include a powered transceiver that performs electrooptical signal conversion combined with clock and data recovery (CDR) and re-modulation of data streams in each direction. The powered transceivers are also known as data recovery and re-modulation (DRR) devices. In at least one contemplated embodiment, the cable connectors 100, 101 are quad small form-factor pluggable double density (QSFP-DD) transceiver modules that exchange 400GAUI-8 data streams with the host.

FIG. 2 is a function-block diagram of an illustrative cable of FIG. 1. Connector 100 includes a plug 200 adapted to fit a standard-compliant Ethernet port in a first host device 302 (see FIG. 3) to receive an electrical input signal carrying an outbound data stream from the host device and to provide an electrical output signal carrying an inbound data stream to the host device. Similarly, connector 101 includes a plug 201 that fits an Ethernet port of a second host device 304. Connector 100 includes a first DRR device 202 to perform CDR and re-modulation of the data streams entering and exiting the cable at connector 100, and connector 101 includes a second DRR device 204 to perform CDR and re-modulation of the data streams entering and exiting the cable at connector 101. Optical transceiver chips 240, 242 provide conversion between electrical and optical signals to convey the data streams across the optical fibers 107 coupled between the connectors 100, 101. The DRR devices 202, 204 may be integrated circuits mounted on a printed circuit board and connected to edge connector contacts and to optical transceivers 240, 242 via circuit board traces. Optical couplers attached to the transceivers 240, 242 may connect the transceivers to the optical fibers 107.

In at least some contemplated embodiments, the printed circuit boards each also support a micro-controller unit (MCU) 206. Each DRR device 202, 204 is coupled to a respective MCU device 206 which configures the operation of the DRR device via a first two-wire bus. At power-on, the MCU device 206 loads equalization parameters and/or other operating parameters from Flash memory 207 into the DRR device's configuration registers 208. The host device can access the MCU device 206 via a second two-wire bus that operates in accordance with the I2C bus protocol and/or the faster MDIO protocol. With this access to the MCU device 206, the host device can adjust the cable's operating parameters and monitor the cable's performance.

Each DRR device 202, 204, includes a set 220 of transmitters and receivers for communicating with the host device and a set 222 of transmitters and receivers for communications via the optical transceivers and intervening optical fibers. The illustrated host-facing transceivers 220 support eight lanes (400GAUI-8) for bidirectional communication with the host device. In other contemplated embodiments, the host-facing transceivers 220 support other data rates and lane configurations. The DRR devices include a memory 224 to provide first-in first-out (FIFO) buffering between the transmitter & receiver sets 220, 222. An embedded controller 228 coordinates the operation of the transmitters and receivers by, e.g., setting initial equalization parameters and ensuring the training phase is complete across all lanes and links before enabling the transmitters and receiver to enter the data transfer phase. The embedded controller 228 employs a set of registers 208 to receive commands and parameter values, and to provide responses potentially including status information and performance data.

The illustrative cable of FIG. 2 may be a part of a point-to-point communications link between two host devices 302, 304 as shown in the architectural diagram of FIG. 3. FIG. 3 shows the architecture using the ISO/IEC Model for Open Systems Interconnection (See ISO/IEC 7498-1:1994.1) for communications via channels 306 over a physical medium such as optical fibers. The interconnection reference model employs a hierarchy of layers with defined functions and interfaces to facilitate the design and implementation of compatible systems by different teams or vendors. While it is not a requirement, it is expected that the higher layers in the hierarchy will be implemented primarily by software or firmware operating on programmable processors while the lower layers may be implemented using micro-code and/or application-specific hardware.

The Application Layer 308 is the uppermost layer in the model, and it represents the user applications or other software operating on different systems that need a facility for communicating messages or data. The Presentation Layer 310 provides such applications with a set of application programming interfaces (APIs) that provide formal syntax along with services for data transformations (e.g., compression), establishing communication sessions, connectionless communication mode, and negotiation to enable the application software to identify the available service options and select therefrom. The Session Layer 312 provides services for coordinating data exchange including: session synchronization, token management, full-or half-duplex mode implementation, and establishing, managing, and releasing a session connection. In connectionless mode, the Session Layer may merely map between session addresses and transport addresses.

The Transport Layer 314 provides services for multiplexing, end-to-end sequence control, error detection, segmenting, blocking, concatenation, flow control on individual connections (including suspend/resume), and implementing end-to-end service quality specifications. The focus of the Transport Layer 314 is end-to-end performance/behavior. The Network Layer 316 provides a routing service, determining the links used to make the end-to-end connection and when necessary acting as a relay service to couple together such links. The Data link layer 318 serves as the interface to physical connections, providing delimiting, synchronization, and sequence and flow control across the physical connection. It may also detect and optionally correct errors that occur across the physical connection. The Physical layer 322 provides the mechanical, electrical, functional, and procedural means to activate, maintain, and deactivate channels 306, and to use the channels 306 for transmission of bits across the physical media.

The Data Link Layer 318 and Physical Layer 322 are subdivided and modified slightly by IEEE Std 802.3-2015, which provides a Media Access Control (MAC) Sublayer 320 in the Data Link Layer 318 to define the interface with the Physical Layer 322, including a frame structure and transfer syntax. Within the Physical Layer 322, the standard provides a variety of possible subdivisions such as the one illustrated in FIG. 3, which includes an optional Reconciliation Sublayer 324, a Physical Coding Sublayer (PCS) 326, a Forward Error Correction (FEC) Sublayer 328, a Physical Media Attachment (PMA) Sublayer 330, and a Physical Medium Dependent (PMD) Sublayer 332.

The optional Reconciliation Sublayer 324 merely maps between interfaces defined for the MAC Sublayer 320 and the PCS Sublayer 326. The PCS Sublayer 326 provides alignment marker insertion/removal, FEC, and framing with synchronization and training sequences. The PMA Sublayer 330 provides symbol encoding/decoding, filtering, conversion between analog and digital signal formats. The PMD Sublayer 332B specifies the optical transceiver conversions between transmitted/received channel signals and the corresponding electrical signals. A receptacle 336 is also shown as part of the PMD sublayer 332 to represent the physical network interface port.

The connectors 100, 101, have plugs 200, 201 representing edge connectors that mate with the receptacles 336 of the two host devices 302, 304. Within each connector, the DRR devices may implement a host-facing Physical Layer 322A, a center-facing Physical Layer 322B, and a Data Link Layer 340 that bridges together the two Physical Layers. In some embodiments, one or more of the internal sublayers within each connector (e.g., PCS, Reconciliation, MAC) are bypassed or omitted entirely to reduce areal requirements and/or to reduce power. More information regarding the operation of the sublayers, as well as the electrical and physical specifications of the connections between the nodes and the communications medium (e.g., pin layouts, line impedances, signal voltages & timing), and the physical specifications for the communications medium itself (e.g., limitations on attenuation, dispersion), can in many cases be found in the current standard, and any such details should be considered to be well within the knowledge of those having ordinary skill in the art.

FIG. 4 provides a more detailed block diagram of illustrative transmit and receive chains in the PCS sublayer. The PCS sublayer transmit chain in FIG. 4 accepts a 400GMII PCS data stream. Pursuant to the standard, the PCS data stream is already encoded with a transmission code that provides DC balance and enables timing recovery. A transcoder module 402 modifies the transmission code from a 64b/66b code to a 256b/257b code more appropriate for use with the FEC strategy. The receive chain includes a reverse transcoder module 403 to reverse this operation.

Generic mapping procedure (GMP) modules 404, 405 each provide a transition between the local system clock domain and the line clock domain, typically providing word padding to accommodate mismatches in clock rates. An alignment marker insertion module 406 provides alignment markers that enable alignment between different lanes of the data stream. Detector module 407 detects and removes the alignment markers. Note that these modules may cooperate with alignment marker insertion/removal modules in other sublayers to preserve alignment marker content across the transcoding process.

A Cyclic Redundancy Check calculation module 408 adds checksum information to the data stream, which is verified on the receiving side by a check module 409 to detect data corruption. An error decorrelation interleaver module 410 redistributes symbols of the data stream, an operation that is reversed by de-interleaver module 411 on the receive side. Staircase encoder module 412 implements a staircase code having BCH component codes. Staircase decoder module 413 reverses this operation as described further below.

Module 414 appends pad bits to align the staircase code words with 400ZR frame boundaries of 400ZR, and applies a predefined scrambling mask. Module 415 reverses these operations on the receive side. A convolutional interleaver 416 disperses bits from the code words to increase resilience to burst errors. De-interleaver module 417 reverses this operation on the receive side.

Hamming encoder module 418 applies an double-extended Hamming Code SD-FEC (128,119) to provide additional net coding gain. Module 419 performs the decoding operation on the receive side. Symbol mapper module 420 maps the coded data stream eight bits at a time to dual polarization 16-point quadrature amplitude modulation (DP-16QAM) symbols, which are then time interleaved by symbol interleaver module 422. The receive chain includes corresponding modules 421, 423 to reverse these operations.

To aid with timing synchronization and clock recovery, module 424 provides a frame alignment word (FAW) at the beginning of each super-frame, which includes 49 subframes each having a training sequence provided by module 424. Module 424 further inserts a pilot symbol every 32 symbols. In the receive chain, module 425 uses the FAW to detect the alignment of super-frames before removing the pilot symbols, training sequences, and FAWs.

Modules 426, 427 represent the PMA sublayer, in which digital signal processing may be performed for spectrum shaping and signal equalization and conversion between digital and analog signal domains. On the receive side, module 427 performs clock recovery as part of the analog-to-digital conversion. Modules 428, 429 represent the PMD sublayer, in which the optical transceivers convert electrical transmit signals to optical signals in the fiber and convert received optical signals to electrical receive signals on the receive side.

The operation of decoder module 413 is illustrated by the block diagram in FIG. 5. A memory 502 receives an encoded input stream conceptually organized as a stairstep arrangement of blocks Bi within a decoding window 504. Each block has a left side having data bits and a right side having parity bits for code words 508, 510 that extend across the data bits and across the preceding data block. A parallel set of BCH decoders 506 alternately operate on vertical code words 508 and horizontal code words 510, iterating forward and backward through the blocks in the decoding window 504. In theory the iteration may be repeated until no further error corrections are made. In practice, a predetermined number of iterations are performed before the decoding window 504 is shifted.

Memory 502 may be organized as bytes or words, causing the memory locations to be accessed in an easy-to-read fashion during initial processing of the block but to be accessed in a more distributed fashion during the subsequent processing of that block. To facilitate subsequent processing, the decoders 506 may form transpositions of each block during the initial processing, storing the transposed blocks in set of transpose memory blocks 512. Further implementation details can be found in the literature, including, e.g., D. Truhachev et al., “Efficient Implementation of 400 Gbps Optical Communication FEC”, IEEE Trans. Circuits & Systems-I, V68n1, Jan. 2021.

It is desirable for each of the BCH decoders 506 to be implemented as efficiently as possible to facilitate high throughput with minimal power consumption.

FIG. 6 is a block diagram of an illustrative BCH decoder which is conceptually implemented in three parts. A syndrome calculator 602 operates on the received message R to obtain syndrome values S. From these syndrome values, a location finder 604 determines the number of errors and the location i of each error (or determines that the number of errors exceeds the correction capability of the code). Using the error locations, an error corrector 606 modifies the received message R to produce a corrected message {circumflex over (R)}.

Conventional implementations of the location finder 604 employ iterative procedures for calculating and factoring the error location polynomial, e.g., Berlekamp-Massey algorithm with a Chien search. It is challenging to accommodate such iteration when decoding high bandwidth data streams. The enhanced decoding method disclosed below may offer at least four potential advantages over existing decoder implementations: (1) the properties of the trace function are exploited for a fast, non-LUT based, determination whether the error polynomial has any roots; (2) the error polynomial factorization operations are simplified through the use of a composite field; (3) the basis conversion to the composite field is precomputed and thus incurs no overhead; and (4) the LUT for finding the cubic roots is made much smaller.

Before describing the enhanced decoding method and decoder of FIG. 8, we explain the BCH code provided in the standard and the algebraic techniques by which the enhanced decoding method is derived. These techniques are applicable to all BCH and RS codes, and particularly suited to such codes having a low correction power, e.g., d<4 or t<9.

The underlying field of the (1022, 990) extended BCH code used in the ITU-T G.709.2/Y.1331.2 standard, is GF(2¹⁰), which can be constructed by the primitive polynomial p(x)=x¹⁰+x³+1. Let α be the root of the primitive polynomial p(x). The non-zero field elements of GF(2¹⁰) can be represented as αⁱ, 0≤i≤1022, which we refer to as the “power” representation, where α¹⁰²³=α⁰=1. A consequence of the field's construction is that its elements can be expressed as a weighted sum of the first ten powers, i.e., we can write αⁱ=b₉α⁹+b₈α⁸+ . . . +b₀, 0≤i≤1022, where the coefficients are binary. For indexing convenience, we refer to the integer l=b₉2⁹+b₈2⁸+ . . . +b₀as the “binary” representation of αⁱ. We define log(l)=i.

BCH encoding is accomplished using a generator matrix defined in terms of these field elements. To enable systematic encoding (where the code words include the original data word concatenated with a set of parity symbols), the generator matrix columns may be permuted to make a portion of the generator matrix resemble an identity matrix. The standard achieves this result by defining a permutation function Π_don the integers i,0≤i≤509. In the following, Π_d(M:M+N)=K:K+N is a shorthand for Π_d(M)=K,Π_d(M+1)=K+1, . . . ,Π_d(M+N)=K+N. Values of Π_dare specified via the following Table 1.

TABLE 1

Values of Π_d

Π_d(0:7) = 478:485
Π_d(8) = 0
Π_d(9:11) = 486:488

Π_d(12) = 1
Π_d(13) = 489
Π_d(14:16) = 2:4

Π_d(17:19) = 490:492
Π_d(20) = 5
Π_d(21) = 493

Π_d(22:24) = 6:8
Π_d(25) = 494
Π_d(26:32) = 9:15

Π_d(33:35) = 495:497
Π_d(36) = 16
Π_d(37) = 498

Π_d(38:40) = 17:19
Π_d(41) = 499
Π_d(42:48) = 20:26

Π_d(49) = 500
Π_d(50:64) = 27:41
Π_d(65:67) = 501:503

Π_d(68) = 42
Π_d(69) = 504
Π_d(70:72) = 43:45

Π_d(73) = 505
Π_d(74:80) = 46:52
Π_d(81) = 506

Π_d(82:128) = 53:99
Π_d(129) = 507
Π_d(130) = 100

Π_d(131) = 508
Π_d(132:256) = 101:225
Π_d(257) = 509

Π_d(258:509) =

226:477

Consider the function f(i) which maps an integer i,1≤i≤1023, to the column vector

$f (i) = [\begin{matrix} β_{i} \\ β_{i}^{3} \\ β_{i}^{5} \\ \frac{F (β_{i})}{F (β_{ι})} \end{matrix}]$

where β_i=α^log(i), and F(β_i)=(b₂^l&b₁^l&b₀^l)|(b₂^l&b₁^l)|(b₂^l&b₁^l&b₀^l) for l the binary representation of β_i, and x is the complementation of x. Then the generator matrix can be obtained in the following way. First assemble the column vectors as:

$H =  [⁠ f (1021) f (1022) f (1) \dots f (510) f (511 + Π^{- 1} (0)) \dots f (511 + Π^{- 1} (5 0 9)]$

Then replace each element in the first three rows of H by its corresponding 10-bit binary representation l and perform elementary row operations (over GF(2)) on H to obtain its row-reduced echelon form with the identity matrix on the right H_ENC=[P^T; I]. The generator matrix is then G=[I; P], where the resulting 990×32 matrix P provides the encoder's parity-generating matrix.

The encoder may be implemented as a multiplication of a 1×990 vector (the 990-bit message) by a 990×1022 matrix (the generator matrix). The matrix element multiplications and additions are performed over GF(2), where AND gate and XOR gates respectively correspond to multipliers and adders. The encoder can thus be implemented as an AND-XOR gate array.

The multiplicative inversion of GF(2¹⁰) via a polynomial basis constructed via p(x)=x¹⁰+x³+1 is complicated. If it is implemented via a look up table (LUT), the depth of LUT will be 1023 such that the critical path can be deteriorated. However, GF(2¹⁰) is isomorphic to the composite field GF((2⁵)²), enabling us to exploit the composite field to simplify the multiplicative inversion. Additionally, the composite field simplifies the solving of quadratic equations, since operations in the composite field GF((2⁵)²) can be decomposed to several operations in the sub-field GF(2⁵).

To use composite field in the (1022, 990) extended BCH encoder/decoder of the ITU-T standard, we must consider basis conversion. We note that the generating polynomial of (1022, 990) BCH code can be the same for both GF(2¹⁰) in polynomial basis constructed via p(x)=x¹⁰+x³+1 and the composite field, because the minimal polynomial of α, α³, α⁵can be the same, i.e., it doesn't depend on what underlying fields are chosen.

Hence, (1022, 990) BCH encoding with composite field representation is the same as the one with polynomial representation defined in ITU-T standard. For decoder, however, we don't use H matrix defined in the standard directly. Rather, we convert H matrix to H_comin composite field. The conversion can be performed as follows:

- The last two rows of H_comare the same as the ones in H.
- Each column in H excluding the last two rows can be treated as 3 field elements in GF(2¹⁰), namely β_i, β_i³, and β_i⁵in polynomial basis representation. Convert β_i, β_i³, and β_i⁵to composite field representation for H_com.

The conversion between H and H_comis pre-computed, i.e., no extra hardware cost is needed. The decoding, including syndrome calculation, error location polynomial finding, and error location polynomial factorization can be done over the composite field GF((2⁵)²). Next, we will introduce the composite field construction.

Let's first review the algebraic structure of a composite field GF(2^mn). GF(2^mn) can be constructed via an irreducible polynomial of degree m over GF(2ⁿ), where GF(2ⁿ) is called the ground field which in turn can be constructed via an irreducible polynomial of degree of n over GF(2). When gcd(m,n)=1, we can use two irreducible polynomials over GF(2) to construct the composite field. For GF((2⁵)²), we can choose f₅(x)=x⁵+x²+1 to construct the sub-field GF(2⁵) over GF(2) and f₂(x)=x²+x+1 to construct GF((2⁵)²) over GF(2⁵). FIG. 7 shows the relationships between the various fields.

Though it is not the case here, we note that if n is 2 and m is an even number, we can't use the irreducible polynomial f₂(x)=x²+x+1 to construct GF((2^m)²) over GF(2^m). Instead, we could find an irreducible polynomial in the form f₂(x)=x²+x+θⁱ, where θ is the primitive element in GF(2^m), by computing the trace value of Tr(θⁱ) for various values of i. If Tr(θⁱ)=1, f₂(x)=x²+x+θⁱis irreducible over GF(2^m), otherwise it is reducible. We can choose the irreducible polynomial with the lowest Hamming weight value of θⁱto construct GF((2^m)²) over GF(2^m).

Because we are contemplating performing our operations over GF(2⁵), we note the existence of efficient solutions for performing generic multiplication of low Hamming weight polynomials, such as trinomials or pentanomials, over GF(2ⁿ). The System Verilog function is:

// N: polynomial degree

// k1,k2,k3: For pentanomial f(x) = xⁿ+ x^k1+ x^k2+ x^k3+ 1

// For trinomial f(x) = xⁿ+ x^k1+1,k2 and k3 are zeros

function logic [N−1:0] gf2n_mul;

input logic [N−1:0] a;

input logic [N−1:0] b;

logic [2*N−2:0] c;

c = ′0;

// polynomial multiplication

for (int ix=0;ix<N;ix++)

for (int jx=0;jx<N;jx++)

c[ix+jx] = c[ix+jx] {circumflex over ( )} (a[ix] & b[jx]);

// reduction

for (int ix=2*N−2;ix>=N;ix−−) begin

c[ix−N] = c[ix−N] {circumflex over ( )} c[ix];

c[ix−N+k1] = c[ix−N+k1] {circumflex over ( )} c[ix]; // for both trinomial and

pentanomial

if (k2!=0 && k3!=0) begin // for pentanomial only

c[ix−N+k2] = c[ix−N+k2] {circumflex over ( )} c[ix];

c[ix−N+k3] = c[ix−N+k3] {circumflex over ( )} c[ix];

end

end

return c[N−1:0];

endfunction

Accordingly, multiplication of such polynomials in GF(2⁵) can be implemented via an arrangement of logical AND (“&”) and logical OR (“{circumflex over ( )}”) gates pursuant to the above method with N=5.

Now we introduce multiplication in the composite field multiplication. Let β be the root of an irreducible polynomial f₂(x)=x²+x+1, i.e., β²=β+1. Let a,b,c ∈GF((2⁵)²), c=a·b, where a=a₁β+a₀, b=b₁β+b₀, c=c₁β+c₀, and a₀, a₁, b₀, b₁, c₀, and c₁∈GF(2⁵). We can derive the formula for computing coefficients c₀and c₁:

$\begin{matrix} c = c_{1} β + c_{0} = (a_{1} β + a_{0}) (b_{1} β + b_{0}) = a_{1} b_{1} β^{2} + (a_{0} b_{1} + a_{1} b_{0}) β + a_{0} b_{0} & (1) \end{matrix}$

$c = (a_{1} b_{1} + a_{0} b_{1} + a_{1} b_{0}) β + (a_{1} b_{1} + a_{0} b_{0})$

By Karatsuba's method,

$\begin{matrix} c = ((a_{0} + a_{1}) (b_{0} + b_{1}) + a_{0} b_{0}) β + (a_{1} b_{1} + a_{0} b_{0}), & (2) \end{matrix}$

$c_{1} = (a_{0} + a_{1}) (b_{0} + b_{1}) + a_{0} b_{0},$

$c_{0} = a_{1} b_{1} + a_{0} b_{0}$

For GF(2ⁿ) with a polynomial basis representation, we can use the following formula to compute the square of a field element. Let a∈GF(2ⁿ), and a=Σ_i=0ⁿ⁻¹a_iαⁱ, where a_i∈GF(2), and α is the polynomial basis generator of GF(2ⁿ).

$c = a^{2} = \sum_{i = 0}^{n - 1} a_{i} α^{2 i}$

Next, we use the field generating polynomial to perform reduction. For GF(2⁵) constructed via f₅(x)=x⁵+x²+1. Let a=a₄α⁴+a₃α³'a₂α²+a₁α+a₀, and c=a²=a₄α⁸+a₃α⁶+a₂α⁴+a₁α²+a₀. By α⁸=α⁵·α³=(α²+1)·α³=α³+α²+1 and α⁶=α³+α, we have,

$c = a_{2} α^{4} + (a_{4} + a_{3}) α^{3} + (a_{4} + a_{1}) α^{2} + a_{3} α + (a_{4} + a_{0})$

For the square of a composite field element,

$\begin{matrix} c = c_{1} β + c_{0} = a^{2} = {(a_{1} β + a_{0})}^{2} = a_{1}^{2} β + (a_{0}^{2} + a_{1}^{2}) & (3) \end{matrix}$

$where$

$c, a \in GF ({(2^{5})}^{2}),$

$and$

$c_{0}, c_{1}, a_{0},$

$and$

$a_{1} \in GF (2^{5}) .$

Turning now to square root operations in the composite field, we first review the algorithm for computing the square root of a non-zero element in GF(2⁵). Let a,c∈GF(2⁵) and c=a². We use the vector form [a₀, a₁, a₂, a₃, a₄] and [c₀, c₁, c₂, c₃, c₄] to represent a and c, where da₀˜a₄and c₀˜c₄∈GF(2). The square in GF(2⁵) can be treated as a 1×5 vector multiplied by a 5×5 matrix, i.e., [c₀, c₁, c₂, c₃, c₄]=[a₀, a₁, a₂, a₃, a₄]·A, where

$A = [\begin{matrix} 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 \\ 0 & 1 & 0 & 1 & 0 \\ 1 & 0 & 1 & 1 & 0 \end{matrix}] .$

Any element has only one square root, i.e., matrix A is invertible to

$A^{- 1} = [\begin{matrix} 1 & 0 & 0 & 0 & 0 \\ 1 & 1 & 0 & 1 & 1 \\ 0 & 1 & 0 & 0 & 0 \\ 1 & 1 & 0 & 0 & 1 \\ 0 & 0 & 1 & 0 & 0 \end{matrix}],$

by which we have

$[a_{0}, a_{1}, a_{2}, a_{3}, a_{4}] = [c_{0}, c_{1}, c_{2}, c_{3}, c_{4}] \cdot A^{- 1}$

Now let's derive the formula for computing the square root of a non-zero element in GF((2⁵)²). By Equation 3, we have:

$\begin{matrix} a_{1} = \sqrt{c_{1}}, a_{0} = \sqrt{c_{0} + c_{1}} & (4) \end{matrix}$

$where$

$c_{0}, c_{1}, a_{0},$

$and$

$a_{1} \in GF (2^{5}) .$

We turn now to showing how to perform multiplicative inversion in the composite field. In GF(2ⁿ), any non-zero field element can be represented as the power of the

$a = α^{i} . By α^{2^{n} - 1} = 1, a^{- 1} = α^{2^{n} - 1 - i} .$

primitive element, namely A 31-depth LUT can be used to implement the multiplicative inversion in GF(2⁵). For multiplicative inversion in the composite field GF((2⁵)²), let a, b∈GF((2⁵)²), a=a₁β+a₀, b=b₁β+b₀, where a₀, a₁, b₀, and b₁∈GF(2⁵). a₀and a₁can't be both zeros. By a·b=1, we can get two linear equations for b₀and b₁:

$\begin{matrix} a_{1} b_{0} + (a_{0} + a_{1}) b_{1} = 0 & (5) \end{matrix}$

$\begin{matrix} a_{0} b_{0} + a_{1} b_{1} = 1 & (6) \end{matrix}$

By solving Equations 5 & 6, we have:

$\begin{matrix} b_{0} = {(a_{0}^{2} + a_{0} a_{1} + a_{1}^{2})}^{- 1} \cdot (a_{0} + a_{1}) & (7) \end{matrix}$

$\begin{matrix} b_{1} = {(a_{0}^{2} + a_{0} a_{1} + a_{1}^{2})}^{- 1} \cdot a_{1} & (8) \end{matrix}$

Thus, composite field GF((2⁵)²) inversion can be implemented via a set of GF(2⁵) multipliers, squarers, and one GF(2⁵) inverter (which can be implemented via 31-depth LUT).

With the foregoing groundwork done, we can efficiently determine the error location polynomial. And rather than relying on the Chien search algorithm to factor the polynomial, which would impose a latency proportional to the code size, we note that the correction power of the (1022, 990) extended BCH code is 3, limiting the degree of the error location polynomial to three. Fast factorization can be achieved by solving the cubic, quadratic, or linear polynomial equations in the composite field.

We start with solving quadratic equations. The equation x²=a, where a∈GF(2ⁿ), always has a double root in a∈GF(2ⁿ). For a receive message having two errors, the error locator polynomial will take the form x²+ax+b=0, where a,b∈GF(2ⁿ), a≠0. This form can be converted to the following equation by replacing x=a·t

$\begin{matrix} t^{2} + t + \frac{b}{a^{2}} = 0 & (9) \end{matrix}$

Let

$s = \frac{b}{a^{2}} .$

Equation 9 has two roots, r and r+1, if and only if Tr_GF(2_n_)|GF(2)(s) =0, where Tr_GF(2_n_)|GF(2)(s) is the trace function mapping s from GF(2ⁿ) to GF(2) as follows:

$\begin{matrix} {Tr}_{GF (2^{n}) | GF (2)} (s) = \sum_{i = 0}^{n - 1} s^{2^{i}} & (10) \end{matrix}$

Before showing how to solve Equation 9 in the composite field, we examine the solution in GF(2¹⁰), which is constructed via p(x)=x¹⁰+x³+1. The GF(2¹⁰) element s can be represented via the polynomial basis {1, α, α², . . . , α⁹} as s=Σ_i=o⁹s_iαⁱ, where α is the root of p(x) and s_i∈GF(2). It can be shown that

${Tr}_{GF (2^{1 0}) | GF (2)} (α^{i})$

is zero for all i≠7, meaning that

$\begin{matrix} {Tr}_{GF (2^{1 0}) | GF (2)} (s) = s_{7} . & (11) \end{matrix}$

If there exist roots in GF(2¹⁰) for the equation x²+x+s=0, then Tr_GF(2₁₀_)|GF(2)(s)=0. By applying the trace mapping on both sides of the equation (abbreviating Tr_GF(2₁₀_)|GF(2)as Tr): Tr(x²+x+s)=Tr(x²)+Tr(x)+Tr(s)=s₇=0, since Tr(x²)=Tr(x). For a GF(2¹⁰) element in standard polynomial basis representation, we need only check the coefficient s₇to determine whether the quadratic equation has roots in GF(2¹⁰). (As an aside, we note that this technique is applicable to all binary fields GF(2ⁿ)).

When the coefficient is zero, let x₁be one root. Then the other root is x₂=x₁+1. By the Hilbert constructive method:

$\begin{matrix} x_{1} = s θ^{2} + (s + s^{2}) θ^{2^{2}} + (s + s^{2} + s^{2^{2}}) θ^{2^{3}} + \dots + (s + s^{2} + \dots + s^{2^{8}}) θ^{2^{9}} & (12) \end{matrix}$

$where$

$θ \in GF (2^{1 0}) and Tr (θ) = 1.$

$By choosing θ = α^{7} :$

$\begin{matrix} (13) \end{matrix}$

$x_{1} = s α^{7 \cdot 2} + (s + s^{2}) α^{7 \cdot 2^{2}} + (s + s^{2} + s^{2^{2}}) α^{7 \cdot 2^{3}} + \dots + (s + s^{2} + \dots + s^{2^{8}}) α^{7 \cdot 2^{9}}$

The α^7·2ⁱterms are constants that can be precomputed, the additions are bitwise XOR operations, and the squaring of the s can be performed by an XOR array in a fashion similar to the polynomial multiplication Verilog code above. These computations can be expressed as a multiplication of the coefficient vector s with a 10×10 matrix whose entries are GF(2) elements.

In the case where the underlying field is the composite field GF((2⁵)²), s∈GF((2⁵)²) and it can be represented as s=s₁β+s₀, where s₀,s₁∈GF(2⁵). Representing one root as c=c₁β+c₀, we have the equation, c₁²β²+c₀²+c₁β+c₀=s₁β+s₀. By replacing β²with β+1,

$\begin{matrix} c_{1}^{2} + c_{1} + s_{1} = 0 & (14) \end{matrix}$

$\begin{matrix} c_{0}^{2} + c_{0} + c_{1}^{2} + s_{0} = 0 & (15) \end{matrix}$

Equations 14 and 15 are quadratic equations in the subfield GF(2⁵), making them simpler to solve. Equation 14 either has two different roots in GF(2⁵) or it has no roots in GF(2⁵). If Equation 14 has no roots in GF(2⁵), there will be no roots in the composite field for Equation 9. Assuming the roots exist, one root is r and the other is r+1.

Replace c₁in Equation 15 with r and r+1. One of

${Tr}_{GF (2^{5}) | GF (2)} ({(r + 1)}^{2} + s_{0})$

$and$

${Tr}_{GF (2^{5}) | GF (2)} (r^{2} + s_{0})$

is zero, and the other must be 1. Keep the root satisfying

${Tr}_{GF (2^{5}) | GF (2)} (c_{1}^{2} + s_{0}) = 0.$

Given the selected root, we can also find two roots of c₀for Equation 12 given the selected root of c₁.

A depth-31 LUT could be used to determine r, and hence r+1, for Equations 14and 15, but a simpler method is available for solving the quadratic equation, x²+x+s=0 in GF(2⁵) (the method is based on Hilbert's Theorem 90).

If the equation x²+x+s=0 has roots in GF(2⁵), then Tr(s)=0. Proof: By applying trace mapping on both sides of the equation,

$Tr (x^{2} + x + s) = Tr (x^{2}) + Tr (x) + Tr (s) = 0$

By Tr(x²)=Tr(x), Tr(s)=0. Since Tr(s)=0 and

$Tr (s) = s + s^{2} + s^{2^{2}} + s^{2^{3}} + s^{2^{4}}$

in GF(2⁵), then

$r_{1} = s^{2} + s^{2^{3}}$

is one root of the equation, and the other root is

$r_{2} = s^{2} + s^{2^{3}} + 1.$

Using the groundwork above, the roots can be found using an arrangement of logic gates. We don't need any LUTs to solve x²+x+s=0 in either GF(2⁵) and GF((2⁵)²). Once these roots are found, the variable substitution for Equation 9 can be reversed to determine the roots of the quadratic error location polynomial. These computations can be expressed as a multiplication of the coefficient vector s with a 10×10 matrix whose entries are GF(2) elements.

Having chosen the irreducible polynomial f₅(x)=x⁵+x²+1 to construct GF(2⁵), the trace calculation to determine whether roots exist is easy. Let α be the root of f₅(x). Then the field element, a can be represented via polynomial basis {1, α, α², α³, α⁴} as a=Σ_i=0⁴a_iαⁱ. As Tr(α)=Tr(α²)=Tr(α⁴)=0 and Tr(1)=Tr(α³)=1, Tr(a)=a₀+a₃.

Turning now to factoring cubic equations in the composite field GF((2⁵)²), we only need to consider two types of cubic equations: x³+d=0, where d≠0, and x³+x+d=0. This simplification results from an ability to convert the general cubic equation, x³+ax²+bx+c=0, a≠0, a, b, c∈GF((2⁵)²), to these two types. The conversion is done by variable substitution, replacing x with t+a, yielding t³+(a²+b)t+(ba+c)=0.

If a²+b=0, we have converted the general cubic equation to the first type. Otherwise, when a²+b+0, replace t with sy, yielding s³y³+(a²+b)sy+(ba+c)=0. Divide this by s³to get

$y^{3} + \frac{a^{2} + b}{s^{2}} y + \frac{ba + c}{s^{3}} = 0 .$

Let

$s = \sqrt[2]{a^{2} + b},$

such that

$d = \frac{ba + c}{{(a^{2} + b)}^{\frac{3}{2}}} .$

Then we have the second cubic equation type with y³+y+d=0.

For the first type, either we can find 3 different roots or no roots in GF((2⁵)²). If we find a root r, the other two roots will be rβ and rβ², where β²+3+1=0, (β³=1). Let r=r₁β+r₀, then we have

$\begin{matrix} r β = r_{1} β^{2} + r_{0} β = (r_{0} + r_{1}) β + r_{1} & (16) \end{matrix}$

$\begin{matrix} r β^{2} = r_{1} β^{2} + r_{0} β = (r_{0} + r_{1}) β^{2} + r_{1} β = r_{0} β + (r_{0} + r_{1}) & (17) \end{matrix}$

We can use LUT to save only one root, enabling the use of a LUT with a depth of only 341. The other two roots can be calculated via Equations 16 and 17 by an arrangement of logic gates. The logic gate arrangement can then reverse the variable substitution to provide the roots of the original cubic equation.

In GF(2⁵)²), for the second type of cubic equation, x³+x+d=0:

- There are 512 values of d where the equation has only one root in GF((2⁵)²),
- 170 values of d where the equation has 3 different roots,
- 341 values of d where the equation has no roots in GF((2⁵)²), and
- for d=0, 1 is a double root and 0 is the third root.
  
  In our case, the error locator polynomial is cubic only where errors exist in three different locations. Accordingly, we only care about the 170 values of d where the cubic equation has 3 different roots. The LUT would need to store only two roots for each of these values, as the third root is the sum of the first two roots. Once the roots are obtained, the variable substitutions are reversed with an arrangement of logic gates to obtain the roots of the original equation.

Having covered techniques for efficiently finding roots of quadratic and cubic equations, we turn to a discussion of the decoder implementation, which begins with the syndrome calculator 602.

Though the composite field GF((2⁵)²) is isomorphic to GF(2¹⁰), the H matrix defined in ITU standard is not suitable for performing syndrome calculation. Instead, we should use H_compfor syndrome calculation. For pre-computation of H_comp, we find the roots of the primitive polynomial p(x)=x¹⁰+x³+1 defined in ITU standard in the composite field GF((2⁵)²), which can be done by brute force method using the composite field multiplication defined above. There are 10 conjugate roots that can be found. For simplicity, we take the root with lowest Hamming weight to be the selected α, the root of p(x). In Verilog notation, it is 10′h0C1. With the composite field representation of α, we can compute αⁱ, α³ⁱ, and α⁵ⁱfor each column of H_comp. (The last two rows of H_compare the same as those of the ITU standard's H matrix. Then we can calculate the syndrome of received word via the following equation,

$\begin{matrix} s = H_{comp} \cdot {\vec{r}}^{T} & (18) \end{matrix}$

where {right arrow over (r)}=[r₀, r₁, . . . , r₁₀₂₁] denotes the binary bitstream of the received word. We can choose either partial parallel architecture or full parallel architecture to implement it. As an example of for a partial parallel architecture, we can scan 256 bits each cycle (254 bits for the last cycle) for 4 cycles, accumulating the partial sum in each cycle to the syndrome register. For the full parallel architecture, the syndrome calculator takes 1022-bit inputs. If it is difficult to complete such a big XOR summation within one cycle, we can insert 2 or 3 pipeline stage registers to shorten the critical path.

As (1022, 990) extended BCH code's error correction capability is only 3, the location finder 604 need not use the Berlekamp-Massey algorithm to find the error location polynomial and need not use the Chien search algorithm to factor the error location polynomial. Rather, the error location polynomial can be calculated based on the syndrome. Once we find the error location polynomial, we can solve the linear, quadratic, or cubic equations as provided above to factor the error location polynomial and thereby determine the error locations.

The syndrome calculator 602 feeds a 32-bit syndrome vector to the location finder 604. The first 30 bits are three GF((2⁵)²) evaluation results of the received polynomial at αⁱ, α³ⁱ, and α⁵ⁱ. The last 2-bits are for extra parity checking to lower the miscorrection rate. With an error correction capability of 3, there are four possibilities to consider for the number of errors in the received message: 0, 1, 2, or 3. The error corrector can determine which of the four possibilities applies based on the syndrome.

Refining the approach outlined by Truhachev et al., “Efficient Implementation of 400 Gbps Optical Communication FEC”, IEEE Trans. Circuits & Systems, v68n1, Jan 2021, we first define the error location polynomial as follows, which is the reciprocal of the one defined in most textbooks. Let eloc(X)=Π(X+X_i), where X_i=α^kⁱ, i=1, 2, or 3.

- Case 1: If the first 30 syndrome bits are zero and either of the last 2 syndrome bits are nonzero, the receive message has more errors than can be corrected. If all 32syndrome bits are zero, the receive message contains no errors and the output data is the same as the input.
- Case 2: If the receive message contains a single error, eloc(X)=X+X₁, with the first ten syndrome bits S₁=X₁, the next ten syndrome bits S₃=X₁³, and the third ten syndrome bits S₅=X₁⁵. As X₁≠0, S₁≠0. Add the cube of S₁to S₃, D₃=S₃+S₁³=X₁³+X₁³=0, and the fifth power of S₁to S₅, D₅=S₅+S₁⁵=X₅³+X₅³=0. If S₁is nonzero while D₃and D₅are zero, there is a single error at k₁. k₁can be found by taking log(X₁), or by finding the column of H_compwhere the first ten bits match X₁.
- Case 3: If the receive message contains two errors,

$eloc (X) = (X + X_{1}) (X + X_{2}) = X^{2} + (X_{1} + X_{2}) X + X_{1} X_{2} = X^{2} + S_{1} X + (\frac{s_{3}}{s_{1}} + S_{1}^{2}) = X^{2} + S_{1} X + \frac{D_{3}}{s_{1}} .$

$Because$

$X_{1} \neq X_{3} \to S_{1} \neq 0.$

$X_{1} X_{2} \neq 0 \to D_{3} \neq 0.$

$S_{5} S_{1} = (X_{1}^{5} + X_{2}^{5}) (X_{1} + X_{2}) = S_{3}^{2} + X_{1} X_{2} S_{1}^{4} = S_{3}^{2} + (\frac{s_{3}}{s_{1}} + S_{1}^{2}) S_{1}^{4} = S_{3}^{2} + S_{3} S_{1}^{3} + S_{1}^{6} \to D_{5} S_{1} = D_{3} S_{3}$

- Thus, if S₁is nonzero and D₃is nonzero while a nonzero D₅S₁=D₃S₃, there are two errors. The coefficients of the quadratic error polynomial are 1, S₁, and D₃/S₁, and it can be factored as described previously.
- Case 4: If the receive message contains three errors, X₁, X₂, X₃≠0, X₁≠X₂, X₂≠X₁≠X₃. The error location polynomial is

$eloc (X) = (X + X_{1}) (X + X_{2}) (X + X_{3}) = X^{3} + (X_{1} + X_{2} + X_{3}) X^{2} + (X_{1} X_{2} + X_{2} X_{3} + X_{1} X_{3}) X + X_{1} X_{2} X_{3},$

$replace X with X_{1}, X_{2}, or X_{3},$

$\begin{matrix} X_{1}^{3} + (X_{1} + X_{2} + X_{3}) X_{1}^{2} + (X_{1} X_{2} + X_{2} X_{3} + X_{1} X_{3}) X_{1} + X_{1} X_{2} X_{3} = 0 & (19) \end{matrix}$

$\begin{matrix} X_{2}^{3} + (X_{1} + X_{2} + X_{3}) X_{2}^{2} + (X_{1} X_{2} + X_{2} X_{3} + X_{1} X_{3}) X_{2} + X_{1} X_{2} X_{3} = 0 & (20) \end{matrix}$

$\begin{matrix} X_{3}^{3} + (X_{1} + X_{2} + X_{3}) X_{3}^{2} + (X_{1} X_{2} + X_{2} X_{3} + X_{1} X_{3}) X_{3} + X_{1} X_{2} X_{3} = 0 & (21) \end{matrix}$

- By summing the above 3 equations:

$\begin{matrix} S_{3} + S_{1}^{3} + (X_{1} X_{2} + X_{2} X_{3} + X_{1} X_{3}) S_{1} + X_{1} X_{2} X_{3} = 0 & (22) \end{matrix}$

- From the weighted sum of equations 19-21

$(19) \times X_{1}^{2} + (2 0) \times X_{2}^{2} + (2 1) \times X_{3}^{2}$

- we get

$\begin{matrix} S_{5} + S_{1}^{5} + (X_{1} X_{2} + X_{2} X_{3} + X_{1} X_{3}) S_{3} + X_{1} X_{2} X_{3} S_{1}^{2} = 0 & (23) \end{matrix}$

$[\begin{matrix} S_{1} & 1 \\ S_{3} & S_{1}^{2} \end{matrix}] [\begin{matrix} X_{1} X_{2} + X_{2} X_{3} + X_{1} X_{3} \\ X_{1} X_{2} X_{3} \end{matrix}] = [\begin{matrix} D_{3} \\ D_{5} \end{matrix}]$

- Next, we will show the matrix,

$[\begin{matrix} S_{1} & 1 \\ S_{3} & S_{1}^{2} \end{matrix}],$

is invertible.

$\det ([\begin{matrix} S_{1} & 1 \\ S_{3} & S_{1}^{2} \end{matrix}]) = S_{3} + S_{1}^{3} = D_{3},$

- i.e., we need to show that D₃≠0. First, we show

$that$

$D_{3} = (X_{1} + X_{2}) (X_{2} + X_{3}) (X_{1} + X_{3})$

$\begin{matrix} \begin{matrix} (X_{1} + X_{2}) (X_{2} + X_{3}) (X_{1} + X_{3}) = (S_{1} + X_{3}) (S_{1} + X_{1}) (S_{1} + X_{2}) \\ = S_{1}^{3} + (X_{1} + X_{2} + X_{3}) S_{1}^{2} + (X_{1} X_{2} + \\ X_{2} X_{3} + X_{1} X_{3}) S_{1} + X_{1} X_{2} X_{3} \\ = (X_{1} X_{2} + X_{2} X_{3} + X_{1} X_{3}) S_{1} + X_{1} X_{2} X_{3} \end{matrix} & (24) \end{matrix}$

$By Equations (22) and (24),$

$D_{3} = (X_{1} + X_{2}) (X_{2} + X_{3}) (X_{1} + X_{3}) \neq 0,$

$[\begin{matrix} σ_{2} \\ σ_{3} \end{matrix}] = [\begin{matrix} X_{1} X_{2} + X_{2} X_{3} + X_{1} X_{3} \\ X_{1} X_{2} X_{3} \end{matrix}] = D_{3}^{- 1} [\begin{matrix} S_{1}^{2} & 1 \\ S_{3} & S_{1} \end{matrix}] [\begin{matrix} D_{3} \\ D_{5} \end{matrix}] = [\begin{matrix} \frac{D_{5}}{D_{3}} + S_{1}^{2} \\ S_{3} + \frac{S_{1} D_{5}}{D_{3}} \end{matrix}]$

$\begin{matrix} \begin{matrix} eloc (X) = X^{3} + S_{1} X^{2} + (\frac{D_{5}}{D_{3}} + S_{1}^{2}) X + S_{3} + \frac{S_{1} D_{5}}{D_{3}} \\ = X^{3} + S_{1} X^{2} + (\frac{D_{5}}{D_{3}} + S_{1}^{2}) X + \frac{S_{3} D_{3} + S_{1} D_{5}}{D_{3}} \end{matrix} & (25) \end{matrix}$

- Thus, if S₁is nonzero, D₃is nonzero, and S₃D₃+S₁D₅is nonzero, there are three errors. The coefficients of the cubic error location polynomial are

$1, S_{1}, (\frac{D_{5}}{D_{3}} + S_{1}^{2}), and \frac{S_{3} D_{3} + S_{1} D_{5}}{D_{3}},$

- and it can be factored as described previously.

To summarize, the conditions for the above 4 cases are:

$\begin{matrix} {cond}_{1} : S_{1} = 0 && S_{3} = 0 && S_{5} = 0; \\ {cond}_{2} : S_{1} \neq 0 && D_{3} = 0 && D_{5} = 0; \\ {cond}_{3} : S_{1} \neq 0 && D_{3} \neq 0 && D_{5} S_{1} = D_{3} S_{3}; \\ {cond}_{4} : D_{3} \neq 0 && S_{3} D_{3} + S_{1} D_{5} \neq 0; \end{matrix}$

By now, we have shown how to calculate error location polynomials according to three 10-bit syndromes. By the method proposed in Section 2, if we found that the number of roots is less than the degree of the error location polynomial, we can justify that the received word can't be decodable.

With the foregoing in mind, FIG. 8 is a flow diagram of an illustrative decoding method enabling a BCH decoder to be implemented with an application-specific arrangement of logic gates and two relatively small look up tables. Alternative implementations may include programmable logic devices, field-programmable gate arrays, embedded controllers or processors executing suitable firmware, and general purpose processors executing suitable software from memory.

Block 802 represents a determination of the syndrome values S₁, S₃, S₅, P by syndrome calculator 602. This determination corresponds to a multiplication of the receive message vector by the H_compmatrix, which is implementable by an array of logic gates (AND gates for multiplications and XOR gates for summing the products). Though shown as being contingent on later-described tests, blocks 812, 820, and 838 may be speculatively performed ahead of time by syndrome calculator 602, also via an array of logic gates that carry out the addition, multiplication, power, root, and inverse multiplication operations to determine the values of the relevant variables.

Block 804 represents a test by location finder 604 to determine if the syndrome values S₁, S₃, S₅, are all zero, corresponding to the case where no correctable errors are present. Though shown as being contingent, blocks 806, 814, 822, and 832 may be speculatively performed by location finder 604 in parallel with block 804 to determine which error locating procedure should be performed. If syndrome values S₁, S₃, S₅, are all zero, location finder 604 checks whether the parity value P is zero in block 806. If not, the receive message vector is flagged in block 808 as having uncorrectable errors. Regardless of whether there are no errors or the errors are uncorrectable, the receive message vector is produced as output data in block 810.

Block 812 represents the determination of values D₃=S₃+S₁³and D₅=S₅+S₁⁵. Block 814 represents a test of whether these values are both zero, corresponding to the case where the receive message contains a single error. If so, block 816 represents the location finder's determining of the single error location from the S₁syndrome value. As with the locations corresponding to roots of any of the polynomials, an indexing circuit can determine the bit index of an error in the receive message vector by comparing the first 10 bits of each column of H_comwith the root. If the root equals the first 10 bits, the column index (which equals the bit index of the receive message vector) is an error location. A parity verification can be performed by accumulating the last two bits of the H_comcolumns at the error locations to calculate the parity delta syndrome. If the accumulated parity delta syndrome for all error locations is equal to the syndrome P value, then the errors are correctable.

Block 818 represents correction of the located error(s) by inverting the receive message bit at each error location. Though not expressly shown here, this correction may be made contingent on the determination that the located errors are correctable.

Block 820 represents the calculation of the product values D₅S₁and D₃S₃. Block 822 represents a test of whether these values are equal, corresponding to the case where the receive message contains two errors. If so, block 824 represents the location finder's determination of coefficients for the quadratic error locator polynomial and use of variable substitution to obtain the trailing coefficient for Equation 9.

$S = \frac{D_{3}}{S_{1}^{3}} = \frac{S_{3}}{S_{1}^{3}} + 1$

Using the principles described previously, an array of logic gates can be used to obtain a product of a syndrome value vector with a quadratic solution matrix to obtain the two roots corresponding to the two error locations. The solution matrix can incorporate a reversal of the variable substitution used to obtain Equation 9. Block 826 represents the indexing circuit using the roots to find the matching H_comcolumn indices, which equal the bit indices of the receive message vector where the errors can be found and corrected in block 818 if the errors are determined to be correctable.

If blocks 804, 814, and 822, all yield negative tests, the receive message vector contains at least three errors. Block 832 represents a test by the location finder 604 to determine if the variable-substituted cubic polynomial is a cubic of the first type, in which case the trailing coefficient value is d=D₃=S₃+S₁³. In block 834 the location finder uses the depth-341 lookup table to obtain one of the three roots and uses it to calculate the other two roots. Supporting logic gates can be used to reverse the variable substitution, i.e., to add S₁to each root. Block 836 represents the indexing circuit using the roots to find the matching H_comcolumn indices, which equal the bit indices of the receive message vector where the errors can be found and corrected in block 818 if the errors are determined to be correctable.

If the block 832 test is negative, the variable-substituted cubic is of the second type. Block 838 represents the calculation of the trailing coefficient value

$d = {\frac{D_{3}^{5}}{D_{5}^{3}}}^{1 / 2} = {\frac{{(S_{3} + S_{1}^{3})}^{5}}{{(S_{5} + S_{1}^{5})}^{3}}}^{1 / 2} .$

In block 840 the location finder uses the depth-170 lookup table to obtain two of the three roots, combining them to calculate the third root. Supporting logic gates can be used to reverse the variable substitution, i.e., multiplying each of them by

${\frac{S_{5} + S_{1}^{5}}{S_{3} + S_{1}^{3}}}^{1 / 2}$

- and adding S₁. Block 842 represents the use of the roots to find the matching H_comcolumn indices, which equal the bit indices of the receive message vector where the errors can be found and corrected in block 818 if the errors are determined to be correctable.

Though operations are described sequentially, it should be understood that the operations can be reordered, parallelized, and/or combined. For example, block 824 can be implemented as an array of logic gates to obtain two polynomial roots as a product of a syndrome value vector and a quadratic solution matrix. The quadratic solution matrix may combine the operations corresponding to a determination of a quadratic equation's trailing coefficient value s, a determination of the quadratic equation's roots, and a reversal of a variable substitution. Any sequence of operations that can each be expressed as a linear transformation can be combined into a single combined linear transformation.

Each of the BCH decoder components (syndrome calculator, error location finder, and error correction logic) can be pipelined. The syndrome calculator operates to compute the product of 1×1022 vector and 1022×32 matrix. If all 1022 input bits are available, the calculation can be performed via an AND-XOR gate array. The XOR gates can be organized in a binary tree. One contemplated implementation uses 3 pipeline stage registers to shorten the critical path. One contemplated implementation of the location finder, including error location polynomial calculation and factorization of the error location polynomial, is pipelined with 4 stage registers. If the number of roots found in the underlying field is smaller than the degree of the error location polynomial, the locator designates the errors as uncorrectable. Otherwise, the error correction logic will perform further check with the parity bits (the last two extended syndrome bits), accumulating the last two bits of the selected error location columns of the check matrix H_comto obtain delta syndromes. The error correction logic takes 4 cycles to complete this further check. If the decoder is parallel out, i.e., all 1022 output bits are sent out simultaneously, the error correction logic needs only one cycle to correct the error bits. Hence, the latency of the decoder is 12 cycles.

This BCH decoder design was synthesized using a commercially available technology library. The target frequency was selected as 937.5 MHz with 20% margin. The synthesis tool used was Synopsis Design Compiler. The total cell area of BCH decoder came out to 9311 um{circumflex over ( )}2. Standard voltage threshold (SVT) device area is 91%, low voltage threshold (LVT) device area is 7.54%, and ultra-low voltage threshold (ULVT) device area is 1.45%.

Numerous alternative forms, equivalents, and modifications will become apparent to those skilled in the art once the above disclosure is fully appreciated. It is intended that the claims be interpreted to embrace all such alternative forms, equivalents, and modifications that are encompassed in the scope of the appended claims.

FAST EFFICIENT DECODER FOR LOW DISTANCE RS AND BCH CODES

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims