Computer systems are often designed based on the paradigm that the CPU memory pair is fast, while network and storage are slow. Over the years, CPU memory and storage components developed their own languages and interfaces that require layers of software to translate CPU memory commands into network and storage commands and vice versa. The speed of the CPU-memory pair relative to network and storage I/O was such that these software layers had minimal impact on system performance. However, network and storage technologies are quickly catching up with CPU-memory speeds and the burden of generations of software layers now becomes significant. To address these concerns, there has been an emergence of physical layer (PHY) technology, namely Gen-Z technology, as a solution to eliminate existing system bottlenecks and significantly improve system efficiency and performance by unifying communication paths and simplifying software.
In the past, with lower bandwidths and less sophisticated PHY technology (e.g., prior to Gen-Z), link reliability was fairly good relative to performance. Forward error correction (FEC) was generally not required, because error detection with retry was an efficient strategy with good performance. In cases where FEC was used, it involved relatively simple encoding and formatting, permitting correction of trivial errors (e.g., isolated bit errors, or short single burst errors). As technology advances have regularly increased achievable bandwidth, stronger and more sophisticated FEC schemes have been introduced. However, low bandwidth overhead has generally been a priority in these existing FEC schemes, resulting in very large code word size and large and complex decoders (e.g. Ethernet). This results in high latency, which is generally acceptable for storage and networking, but not in cases where considering processor to memory accesses with load/store semantics are involved. For example, in many Gen-Z based systems, achieving a lowest possible latency is desirable.
The present disclosure, in accordance with one or more various embodiments, is described in detail with reference to the following figures. The figures are provided for purposes of illustration only and merely depict typical or example embodiments.
The figures are not exhaustive and do not limit the present disclosure to the precise form disclosed.
Various embodiments described herein are directed to forward error correction (FEC) techniques that can be used in PHY links that support advanced PHY technology, such as Gen-Z technology. The disclosed FEC techniques are distinctly designed to minimize latency in the PHY links in a manner that is optimal for Gen-Z systems. Implementing low latency FEC techniques, as disclosed herein, provides improvements over many existing FEC schemes that typically employ very large code word sizes and result in high latency, as alluded to above.
As a general description, the disclosed FEC techniques are designed to minimize two contributors to FEC-based latency, namely codeword length and correction latency. That is, the FEC techniques use pre-coding, binary FEC encoding that uses a physical unit formatting (referred to herein as phits), and a control bit sanity check. According to the embodiments, these aspects of the FEC techniques operate interdependently such that latency can be reduced in PHY links, particularly Gen-Z links.
The demand for high-bandwidth, low-latency fabrics for high complexity applications, such as High Performance Computing (HPC) and memory-semantic, continues to grow. However, as interconnect bandwidth has increased, PHY designers have struggled to keep link error rates “invisible” to application performance. In particular, the transition from two-level to four-level Pulse-Amplitude Modulation (PAM4) signaling (where two bits of data are transmitted per symbol) has increased error rates so that simply retrying packets to recover from errors would result in significant performance degradation. Therefore, some link protocols have added FEC schemes to correct link errors at the receiver without requiring retry. Nonetheless, many of these conventional FEC schemes add latency for every link hop. Traditional fabrics like InfiniBand and Ethernet have high-latency FEC, which contributes to reduced performance for latency sensitive applications such as fabric load/store and node-to-node messages. For example, inter-processor messages traverse three switch hops in one direction for a request, and return back another three hops for the response in a 3D HyperX topology. The switch chip can have a 80 ns latency resulting in 480 ns end-to-end switch latency. As an example, FEC for InfiniBand adds 37.8 ns per switch for an additional 226.8 ns end-to-end switch latency. On the other hand, a low-latency FEC scheme that would be able to achieve 2 ns FEC, for example, would only add 12 ns end-to-end through the switch fabric resulting in a significant performance gain and competitive advantage. Accordingly, the disclosed FEC techniques can be applied to links that characteristically have high probability of burst errors, like PAM4 links, and Gen-Z links (having a lower BER) and realize advantage associated with low latency and correction latency in HPC and memory-semantic applications.
As previously described, latency with respect to FEC schemes often originates from two sources, codeword length and correction latency. FEC codeword can be thought of as an “FEC packet” that is composed of a data payload and FEC redundancy. For example, for the device 100 to transfer a packet from its PHY 114a to the PHY 121 of device 120, the entire codeword must be received before the data payload can be consumed by the receiver, namely device 120. Thus, the larger the codeword, the higher the FEC latency, generally. After the entire codeword is received by the PHY 121 of device 120, errors can be detected and corrected at the receiver, which typically involves a pipelined FEC decoder. The greater the correction capability of the FEC, the greater the number of pipeline stages of the FEC decoder and the greater the correction latency. Therefore, a “lowest-latency” FEC scheme would theoretically consist of an extremely small codeword and light-weight error correction (e.g., small number of pipeline stages). The FEC techniques, according to the embodiments, have been designed to move in the direction of “lowest-latency” by reducing the FEC codeword size and correction latency using: pre-coding; Bose-Chaudhuri-Hocquenghem (BCH) encoding; correction bypass; and minimized channel noise (e.g., low BER channel). With each of these aspects working together, the disclosed FEC techniques can be used by systems having advanced PHY technology, such as a Gen-Z system 100, to achieve low latency and error correction (e.g., link reliability) in operation. For instance, in the case of a PAM4 link, the disclosed FEC techniques can achieve a latency as low as 2 ns per-link hop for a four-lane link.
The PHY requirements for Gen-Z technology implement a sufficiently low raw Bit Error Ratio (BER). Accordingly, the disclosed FEC techniques can further leverage this feature of Gen-Z technology for also realizing low latency. Low BER is critical to reducing the number of errors that need to be corrected, which allows for a smaller codeword to be used by the FEC techniques, and thereby minimizes correction latency. As previously described, Gen-Z links can have a raw BER of 10−9 pre-correction. This can be achieved for both fabric (20 dB Insertion Loss) and local channels (10 dB Insertion Loss), provided certain channel optimizations (e.g., minimization of the channel noise and reflectivity) are made. As an example, a PAM4 212.5 Gbps 4-lane link with a 10−9 BER can result in one error every 4.7 ms on average, which means the FEC would only need to correct a single error event per codeword in order to achieve 10−15 BER post-correction. Whereas, Ethernet's supports up to 3.13E-4 raw BER (one error every 147.3 ns), which requires its FEC to support correction of several error events per codeword, leading to increased codeword size and correction latency.
In detail, the disclosed FEC techniques utilize a codeword which is small (to reduce overhead), and limit error correction while providing strong error detection to minimize the chance of silent data corruption. Even though the low raw BER implement in Gen-Z technology generally reduces the likelihood of error occurrences, burst errors can still exist on the link which makes continued use of a FEC scheme desirable. For example, the Decision Feedback Equalizer (DFE), a receiver circuit designed to remove inter-symbol interference (ISI), can cause single error events to propagate into a burst of errors due to its feedback loop. In the worst case, a PAM4 DFE will cause an error to propagate 75% of the time, which produces a new error that also has a 75% chance of propagating. This cycle may repeat until the error stops propagating. Many link protocols deal with this error propagation using Reed-Solomon (RS) FEC that corrects any combination of bit flips per FEC symbol. FEC symbols are usually 8 or 10 bits, and a DFE-induced burst error caused by a single-error event can easily span 3 or more symbols. The effects of burst errors on conventional FEC schemes, such as RS FEC, can lead to increased FEC decoder complexity, area, and latency in order to achieve the necessary correction capability. Many link protocols allow the use of pre-coding as a mechanism for reducing the number of bit flips caused by a DFE burst error.
The disclosed FEC techniques are based on used of a FEC codeword. As alluded to above, FEC codewords of the embodiments are generally small in size, for instance 288 bits, in order to achieve a low latency. Additionally, the FEC techniques can adjust for higher BER use cases, by employing larger codeword sizes, for instance carrying 512 or 768 packet stream bits. An example of a small codeword for achieving low latency, comprising a 288-bit phit (physical unit), is shown in
In detail,
The BCH encoding applied to the phit 200 can be a binary CRC encoding, and the codeword is bit-reversed for transfer on the link, as is common for CRC codes. The “lowest” bit of the packet stream 205 will travel first on the link and the “highest” bit of the FEC parity 215 will travel last. Furthermore, interspersing the ctl bits 210 within the phit 200 can reduce the probability that all of the multiple copies of the ctl bits 211-214 will be affected by a single error event, particularly when the copies of the ctl bits 211-214 travel on separate lanes of the link. In transmission across the fabric, the 288-bit phit 200 can be striped across lanes of the link. For example, the phit 200 is transferred on a four-lane link, where the bits indicated at the top of
Now referring to
By employing phits, such as phit 300 (and phit shown in
Generally speaking, there are two different Phit FECs that can be employed by the FEC techniques. For instance, which phit utilized by the FEC techniques can be based on the raw BER of the link. Links having a raw BER of 10 or better can use the Phit FEC 288, and links with a raw BER 10−7 or better can use the Phit FEC 320. As alluded to above, a Phit FEC 320 may be more optimal in links with a higher BER because of its added redundancy. Both Phit FECs can result in a corrected BER of 10−15 or better. If raw BER is not sufficiently random to reach a corrected BER of 10−15 or better, e.g. due to high crosstalk between lanes, the raw BER should be reduced until a corrected BER of 10−15 or better can be reached.
Additionally, the disclosed FEC techniques can further leverage the use of larger phits, such as FEC Phit 320. In other words, although the phits may be larger, contributing to a greater latency, the phits are also formatted for improved redundancy (e.g., a 60 bit codeword). Accordingly, the FEC techniques also employ phits of multiple variable sizes. In the embodiments, variable size phits can include: phits including 256 bits of Gen-Z packet stream (shown in
Referring now to
In
Referring now to
At operation 605, data can be encoded in accordance with Phit Forward Error Correction (FEC). Transmit data can be sent from a Gen-Z core over the PLA Interface to the PHY of the device transmitting the data, also referred to as the transmitting PHY. This data can be encoded at operation 605 in accordance with Phit Forward Error Correction (FEC). That is, the data is encoded to be formatted into phit “packets”, which allows protection of the data as it is transmitted on the physical link. As disclosed throughout, the use of phits in the FEC techniques allows for burst error to be correctable (and detection) in manner that enhances link reliability and reduces the number of transaction re-tries. The encoding of operation 605 can be BCH encoding. Encoding of operation 605 can particularly involve formatting the data into any of the phits described in reference to
Next, at operation 610, the data is pre-coded. Generally, pre-coding the data in operation 610 can help turn burst errors, caused by the receiver DFE, into two single bit errors and can be useful when remapping Gen-Z to a higher transfer rate than the default. In other words, pre-coding helps to reduce bit flips caused by DFE-induced burst errors. As previously described, one of the aspects of the disclosed FEC techniques is pre-coding. Pre-coding is a method of encoding data at the transmitting PHY that, in most cases, causes DFE-induced burst errors to turn into two bit flips (a single-bit flip at the start of the burst and a single-bit flip at the end of the burst). As background, when Serializer/Deserializer (SERDES) link errors are analyzed, there are usually two main factors to consider: (1) Bit Error Rate (BER), which determines the frequency of independent errors; and (2) Decision Feedback Equalizer (DFE) error propagation, whereby there is a certain probability that an independent error will propagate to subsequent symbols. These factors are interdependent, as aggressive DFE tap weights (intended to minimize BER) generally result in a higher probability of error propagation (longer burst errors). In other words, strategies that reduce the frequency of errors tend to result in more severe (less easily corrected) errors.
When considering the burst errors resulting from DFE error propagation, pre-coding (for NRZ or PAM4) is a useful method for converting long burst errors to isolated bit errors. In general, a single burst can be converted to 2 bit errors (e.g., entry and exit bit errors). Pre-coding has been proposed and enabled in-various standards (e.g. 802.3) but has not been widely used. With a large variability in channel quality (resulting from a variety of routing topologies, cables connectors, etc.), there tends to be pathological scenarios where pre-coding is not optimal (e.g., does not convert a burst error to 2 bit errors with high probability). This has resulted in a perception of risk such that FEC schemes have been designed with no dependence on pre-code (i.e. requiring correction of long burst errors). However, with Gen-Z technology, the PHY specification is much more restrictive with respect to supported channel topologies. This makes it so that pre-coding is more optimal, and can work well in expected real-world scenarios. Therefore, FEC techniques for use in Gen-Z leverages the use of pre-coding, and benefits from it in various ways.
However, no link protocol fully relies on pre-coding for mitigating DFE burst errors. Instead, pre-coding is an optional feature. Fully relying on pre-coding is risky because pre-coding effectiveness varies depending on the channel. Also, pre-coding can cause more bit flips in cases where the DFE does not propagate bit errors. For example, if an error event produces a single-bit error that is not propagated by the DFE, pre-coding will turn that single-bit flip into a double-bit flip. That is, pre-coding is performed in operation 610 to enhance the burst error detection and correction of the process 600 (but it not the only mechanism employed).
It can be assumed that the process 600 has converted DFE produced burst errors to only two errors, entry and exit errors on 20 contiguous UI patterns after the data is pre-coded in operation 605. Then, the process 600 can proceed to operation 615 where the data is transmitted in the phit format to the receiving PHY. The data can be transmitted in operation 615 serially by the lane drivers over the electrical or optical physical medium. Further, the phit can be striped across lanes when it is transmitted, which protects the control bits of the phit in a reliable way. For example, transmitting in operation 615 can include transferred phits on a four-lane link, where the phit is striped across lanes with byte granularity. Even further, the codeword used by process 600 is a small size (e.g., due to pre-coding and enhanced BER), in a manner that allows the FEC techniques to achieve low latency.
When a phit successfully arrives at the receiving PHY, the receive PHY can begin the process of decoding of the FEC codeword at operation 620. Operation 620 can involve a reverse pre-coding of the received data. As a general description, reverse pre-coding is a process that reverses the effect of pre-coding the data, as performed in previous operation 610 (at the transmitting PHY). The receiving PHY may perform reverse pre-coding before the FEC decode.
Decoding of operation 620 can involve, reversing the final step of phit construction. So, for example, the four copies of the “ctl” bit and the 256 bits of the packet stream are once again a contiguous payload in the FEC codeword. The codeword can be bit-reversed if necessary (depends on decoder implementation). Then, FEC decoding can be performed on the codeword. As alluded to above, because the pre-coding has been performed at the transmitting PHY, decoding at the receiving PHY can be light-weight by restricting correction to 2 bit errors. addition to the FEC decode/correction. Additionally, decoding at operation 620 can include a sanity check. The sanity check is performed to make sure the corrected copies of the control bit all agree. For example, the process 600 can check for identical (corrected) copies of the “ctl” bit in the received phit. If the sanity check is successful, then any errors in the phit can be considered corrected by the FEC techniques. However, if the sanity check fails, then the phit is treated as FEC uncorrectable (resulting in retry of affected packets). For instance, if all four copies of “ctl” bit are not identical, then link resynchronization can be initiated for a transaction re-try.
Referring now to
A pattern generator 730 can create transmitter data which is pre-coded by the pre-coder 706 at the transmitter 705. Then, the pre-coded data can be modified by the channel 710. The channel 710 can inject a single-bit error that is propagated by the DFE 721 at the receiver 720, and then reverse pre-coded by the reverse pre-coder 722. The resulting receiver data is received by a pattern checker 750 in order to be checked for errors. The error pattern and burst length can be recorded by the pattern checker to determine pre-coding effectiveness. For example, the physical channel analysis tool 700 can emulate a wide range of channels and transmit data. According to some results of the physical channel analysis tool 700, pre-coding was 99.9% effective in transforming DFE-induced burst errors into two bit errors. This allows the FEC techniques to utilize binary BCH encoding with error correction constrained to two bit flips to correct an entire error event including DFE error propagation, thereby requiring a less complex decoder and performing stronger error detection.
The computer system 800 also includes a main memory 806, such as a random-access memory (RAM), cache and/or other dynamic storage devices, coupled to fabric 802 for storing information and instructions to be executed by processor 804. Main memory 806 also may be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 804. Such instructions, when stored in storage media accessible to processor 804, render computer system 800 into a special-purpose machine that is customized to perform the operations specified in the instructions.
The computer system 800 further includes storage devices 810 such as a read only memory (ROM) 808 or other static storage device coupled to fabric 802 for storing static information and instructions for processor 804. A storage device 810, such as a magnetic disk, optical disk, or USB thumb drive (Flash drive), etc., is provided and coupled to fabric 802 for storing information and instructions.
The computer system 800 may be coupled via fabric 802 to a display 812, such as a liquid crystal display (LCD) (or touch screen), for displaying information to a computer user. An input device 814, including alphanumeric and other keys, is coupled to fabric 802 for communicating information and command selections to processor 804. Another type of user input device is cursor control 816, such as a mouse, a trackball, or cursor direction keys for communicating direction information and command selections to processor 804 and for controlling cursor movement on display 812. In some embodiments, the same direction information and command selections as cursor control may be implemented via receiving touches on a touch screen without a cursor.
The computing system 800 may include a user interface module to implement a GUI that may be stored in a mass storage device as executable software codes that are executed by the computing device(s). This and other modules may include, by way of example, components, such as software components, object-oriented software components, class components and task components, processes, functions, attributes, procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables.
In general, the word “component,” “engine,” “system,” “database,” data store,” and the like, as used herein, can refer to logic embodied in hardware or firmware, or to a collection of software instructions, possibly having entry and exit points, written in a programming language, such as, for example, Java, C or C++. A software component may be compiled and linked into an executable program, installed in a dynamic link library, or may be written in an interpreted programming language such as, for example, BASIC, Perl, or Python. It will be appreciated that software components may be callable from other components or from themselves, and/or may be invoked in response to detected events or interrupts. Software components configured for execution on computing devices may be provided on a computer readable medium, such as a compact disc, digital video disc, flash drive, magnetic disc, or any other tangible medium, or as a digital download (and may be originally stored in a compressed or installable format that requires installation, decompression or decryption prior to execution). Such software code may be stored, partially or fully, on a memory device of the executing computing device, for execution by the computing device. Software instructions may be embedded in firmware, such as an EPROM. It will be further appreciated that hardware components may be comprised of connected logic units, such as gates and flip-flops, and/or may be comprised of programmable units, such as programmable gate arrays or processors.
The computer system 800 may implement the techniques described herein using customized hard-wired logic, one or more ASICs or FPGAs, firmware and/or program logic which in combination with the computer system causes or programs computer system 800 to be a special-purpose machine. According to one embodiment, the techniques herein are performed by computer system 800 in response to processor(s) 804 executing one or more sequences of one or more instructions contained in main memory 806. Such instructions may be read into main memory 806 from another storage medium, such as storage device 810. Execution of the sequences of instructions contained in main memory 806 causes processor(s) 804 to perform the process steps described herein. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions.
As used herein, a circuit might be implemented utilizing any form of hardware, software, or a combination thereof. For example, one or more processors, controllers, ASICs, PLAs, PALs, CPLDs, FPGAs, logical components, software routines or other mechanisms might be implemented to make up a circuit. In implementation, the various circuits described herein might be implemented as discrete circuits or the functions and features described can be shared in part or in total among one or more circuits. Even though various features or elements of functionality may be individually described or claimed as separate circuits, these features and functionality can be shared among one or more common circuits, and such description shall not require or imply that separate circuits are required to implement such features or functionality. Where a circuit is implemented in whole or in part using software, such software can be implemented to operate with a computing or processing system capable of carrying out the functionality described with respect thereto, such as computer system 800.
As used herein, the term “or” may be construed in either an inclusive or exclusive sense. Moreover, the description of resources, operations, or structures in the singular shall not be read to exclude the plural. Conditional language, such as, among others, “can,” “could,” “might,” or “may,” unless specifically stated otherwise, or otherwise understood within the context as used, is generally intended to convey that certain embodiments include, while other embodiments do not include, certain features, elements and/or steps.
Terms and phrases used in this document, and variations thereof, unless otherwise expressly stated, should be construed as open ended as opposed to limiting. Adjectives such as “conventional,” “traditional,” “normal,” “standard,” “known,” and terms of similar meaning should not be construed as limiting the item described to a given time period or to an item available as of a given time, but instead should be read to encompass conventional, traditional, normal, or standard technologies that may be available or known now or at any time in the future. The presence of broadening words and phrases such as “one or more,” “at least,” “but not limited to” or other like phrases in some instances shall not be read to mean that the narrower case is intended or required in instances where such broadening phrases may be absent.
Number | Name | Date | Kind |
---|---|---|---|
10432252 | Bayesteh et al. | Oct 2019 | B2 |
10644834 | McClellan et al. | May 2020 | B1 |
20140250344 | Hellge | Sep 2014 | A1 |
20190109752 | Zhang et al. | Apr 2019 | A1 |
20200153458 | Strobel et al. | May 2020 | A1 |
20200169323 | Moro | May 2020 | A1 |
Number | Date | Country |
---|---|---|
WO-2020025768 | Feb 2020 | WO |
Entry |
---|
Cai, S. et al.; “Block Markov Superposition Transmission of Bch Codes with Iterative Erasures-and-errors Decoders”; IEEE Transactions on Communications; Jan. 2019; pp. 17-27; vol. 67; issue 1; IEEE. |
Exascale Computing Project, “PathForward”, available online at <https://web.archive.org/web/20200818160427/https://www.exascaleproject.org/research-group/pathforward/>, Aug. 18, 2018, 2 pages. |
Gen-Z Consortium™, “Physical Layer Specification”, version 1.1, Sep. 17, 2019, pp. 1-224. |
Number | Date | Country | |
---|---|---|---|
20220123860 A1 | Apr 2022 | US |