This invention relates to methods and systems for Hybrid Automatic Repeat Request (HARQ), and more specifically to methods and systems to offload and managing processing on a co-processor to enhance HARQ performance and latency.
The Hybrid Automatic Repeat Request (HARQ) management module is located in the decoding chain, which mainly completes buffering the Log-Likelihood Ratios (LLRs) information of the unsuccessfully decoded transport block (TB) for each User Equipment (UE), LLR combining for each retransmitted code-block (CB) before sending them into a decoder, e.g., a Low Density Parity Check (LDPC) decoder, Cyclic Redundancy Check (CRC) checking for each CB and each TB after the decoder.
An embodiment of the interior block diagram of HARQ management of this invention is shown in
An embodiment of this invention is an enhanced HARQ combing algorithm, whose principle of operation is shown below.
The data interface of an embodiment comprises an input interface for LLR for each HARQ process (HP) or TB, and an output interface for decoded information bits. The control interface of an embodiment comprises an input interface for NDI: new data indicator, rv: HARQ Redundancy version; HPN: HARQ processing number; UE_ID: UE identity; and TBSize: size of transport block, and an output interface for ACK/NACK of TB/HARQ process.
According to 3rd Generation Partnership Project; Technical Specification Group Radio Access Network, NR; Physical layer procedures for data (Release 15), the maximum number of HARQ processes for each UE is Hpn=16, therefore the Base Station (BS) needs to store at most Hpn slots of LLRs for HARQ combining. Let's take a 100 MHz system with 30 KHz subcarrier spacing as an example, where the number of usable subcarrier per OFDM symbol is Nsc=273*12=3276, the number of symbols is L=14 in each slot, the maximum number of raw information bits per subcarrier is Nb=7.4063 as defined in Table 5.1.3.1-2 of 3rd Generation Partnership Project; Technical Specification Group Radio Access Network, NR; Physical layer procedures for data (Release 15). Considering rate=1/3 LDPC encoder defined in 3rd Generation Partnership Project; Technical Specification Group Radio Access Network; NR; Multiplexing and channel coding (Release 15), the maximum buffer size can be estimated as
B
size=3*Nb*Nsc*L*Hpn*S*Bιιr,
where S is the maximum number of data streams multiplexed via MIMO/beamforming and Bιιr is the bit width of LLR. Assuming Bιιr=8 as defined in the LDPC Encoder/Decoder v2.0, LogiCORE IP Product Guide provided by Xilinx, the memory sizes Bsize of soft buffer are listed in Table 1 under various selection of S.
Considering the required large memory size, soft buffer can be implemented by DDR4. Therefore, the peak bandwidth of DDR required to match HARQ combining and LDPC decoder can be estimated as
BWddr=2*3*Nb*NscL*S*Bιιr/Tslot,
where Tslot is the time duration of a slot, e.g., 0.5 ms for 100 MHz systems. With the same assumption in section 3, the required bandwidth are listed in Table 1 under various selection of S.
In one embodiment of this invention, to speed up the computation intensive decoding processing, the decoding chain of a receiver is implemented on a separate processing board, referred to as an offload board, from the processor performing other physical layer functions, and the enhanced HARQ management described above is also implemented in the offload board together with the decoding chain.
The embodiments of HARQ management of this invention achieve top performance compared with reference simulation results provided in 3GPP Release 15, and reduces required buffer size, dramatically reduces number of accesses to the memory, resulting in a smaller pipeline depth and higher throughput.
An embodiment of this invention offloads an encoder to an offload processor, e.g., a co-processor board through the PCIe bus, processes the full transmit (Tx) path in addition to the encoder in the offload processor. This frees up more CPU resources to process the receiving functions and saves transfer bandwidth, e.g., the PCIe bandwidth, between the CPU and the offload processor because it eliminates the need for the offload processor to send the encoded bits back to the CPU for the rest of Tx path processing. The Tx path is a fixed function, there are few advantages to locate the Tx path in the CPU as is done in prior art.
In prior art, Tx processing requires a latency budget of 4 slots, that is, processing for slot N starts at slot N-4 as shown in
Although the foregoing descriptions of the preferred embodiments of the present inventions have shown, described, or illustrated the fundamental novel features or principles of the inventions, it is understood that various omissions, substitutions, and changes in the form of the detail of the methods, elements or apparatuses as illustrated, as well as the uses thereof, may be made by those skilled in the art without departing from the spirit of the present inventions. Hence, the scope of the present inventions should not be limited to the foregoing descriptions. Rather, the principles of the inventions may be applied to a wide range of methods, systems, and apparatuses, to achieve the advantages described herein and to achieve other advantages or to satisfy other objectives as well.
This application claims the benefit of U.S. Provisional Application No. 62/939,637 filed on Nov. 24, 2019.