Not applicable.
Not applicable.
Embodiments of this invention are in the field of digital logic, and are more specifically directed to programmable logic suitable for use in computationally intensive applications such as low density parity check (LDPC) decoding.
High-speed data communication services, for example in providing high-speed Internet access, have become a widespread utility for many businesses, schools, and homes. In its current stage of development, this access is provided by an array of technologies. Recent advances in wireless communications technology have enabled localized wireless network connectivity according to the IEEE 802.11 standard to become popular for connecting computer workstations and portable computers to a local area network (LAN), and typically through the LAN to the Internet. Broadband wireless data communication technologies, for example those technologies referred to as “WiMAX” and “WiBro”, and those technologies according to the IEEE 802.16d/e standards, have also been developed to provide wireless DSL-like connectivity in the Metro Area Network (MAN) and Wide Area Network (WAN) context.
A problem that is common to all data communications technologies is the corruption of data by noise. As is fundamental in the art, the signal-to-noise ratio for a communications channel is a degree of goodness of the communications carried out over that channel, as it conveys the relative strength of the signal that carries the data (as attenuated over distance and time), to the noise present on that channel. These factors relate directly to the likelihood that a data bit or symbol as received differs from the data bit or symbol as transmitted. This likelihood of a data error is reflected by the error probability for the communications over the channel, commonly expressed as the Bit Error Rate (BER) ratio of errored bits to total bits transmitted. In short, the likelihood of error in data communications must be considered in developing a communications technology. Techniques for detecting and correcting errors in the communicated data must be incorporated for the communications technology to be useful.
Error detection and correction techniques are typically implemented by the technique of redundant coding. In general, redundant coding inserts data bits into the transmitted data stream that do not add any additional information, but that indicate, on decoding, whether an error is present in the received data stream. More complex codes provide the ability to deduce the true transmitted data from a received data stream even if errors are present.
Many types of redundant codes that provide error correction have been developed. One type of code simply repeats the transmission, for example by sending the payload followed by two repetitions of the payload, so that the receiver deduces the transmitted data by applying a decoder that determines the majority vote of the three transmissions for each bit. Of course, this simple redundant approach does not necessarily correct every error, but greatly reduces the payload data rate. In this example, a predictable likelihood exists that two of three bits are in error, resulting in an erroneous majority vote despite the useful data rate having been reduced to one-third. More efficient approaches, such as Hamming codes, have been developed toward the goal of reducing the error rate while maximizing the data rate.
The well-known Shannon limit provides a theoretical bound on the optimization of decoder error as a function of data rate. The Shannon limit provides a metric against which codes can be compared, both in the absolute sense and also in comparison with one another. Since the time of the Shannon proof, modern data correction codes have been developed to more closely approach the theoretical limit, and thus maximize the data rate for a given tolerable error rate. An important class of these conventional codes is referred to as the Low Density Parity Check (LDPC) codes. The fundamental paper describing these codes is Gallager, Low-Density Parity-Check Codes, (MIT Press, 1963), monograph available at http://www.inference.phy.cam.ac.uk/mackay/gallager/papers/. In these codes, a sparse matrix H defines the code, with the encodings c of the payload data satisfying:
Hc=0 (1)
over Galois field GF(2). Each encoding c consists of the source message ci combined with the corresponding parity check bits cp for that source message ci. The encodings c are transmitted, with the receiving network element receiving a signal vector r=c+n, n being the noise added by the channel. Because the decoder at the receiver also knows matrix H, it can compute a vector z=Hr. However, because r=c+n, and because Hc=0:
z=Hr=Hc+Hn=Hn (2)
The decoding process thus involves finding the most sparse vector x that satisfies:
Hx=z (3)
over GF(2). This vector x becomes the best guess for noise vector n, which can be subtracted from the received signal vector r to recover encodings c, from which the original source message ci is recoverable.
As shown in
These modulated signals are converted into a serial sequence, filtered and converted to analog levels, and then transmitted over transmission channel C to receiving transceiver 20. The transmission channel C will of course depend upon the type of communications being carried out. In the wireless communications context, the channel will be the particular environment through which the wireless transmission takes place. Alternatively, in a DSL context, the transmission channel is physically realized by conventional twisted-pair wire. In any case, transmission channel C adds significant distortion and noise to the transmitted analog signal, which can be characterized in the form of a channel impulse response.
This transmitted signal is received by receiving transceiver 20, which, in general, reverses the processes of transmitting transceiver 10 to recover the information of the input bitstream. As shown contextually in
There are many known implementations of LDPC codes. Some of these LDPC codes have been described as providing code performance that approaches the Shannon limit, as described in MacKay et al., “Comparison of Constructions of Irregular Gallager Codes”, Trans. Comm., Vol. 47, No. 10 (IEEE, October 1999), pp. 1449-54, and in Tanner et al., “A Class of Group-Structured LDPC Codes”, ISTCA-2001 Proc. (Ambleside, England, 2001).
In theory, the encoding of data words according to an LDPC code is straightforward. Given sufficient memory or sufficiently small data words, one can store all possible code words in a lookup table, and look up the code word in the table corresponding to the data word to be transmitted. But modern data words to be encoded are on the order of 1 kbits and larger, rendering lookup tables prohibitively large and cumbersome. Accordingly, algorithms have been developed that derive codewords, in real time, from the data words to be transmitted. A straightforward approach for generating a codeword is to consider the n-bit codeword vector c in its systematic form, having a data or information portion ci and an m-bit parity portion cp such that the resulting codeword vector c=(ci|cp). Similarly, parity matrix H is placed into a systematic form Hsys, preferably in a lower triangular form for the m parity bits. In this conventional encoder, the information portion ci is filled with n-m information bits, and the m parity bits are derived by back-substitution with the systematic parity matrix Hsys. This approach is described in Richardson and Urbanke, “Efficient Encoding of Low-Density Parity-Check Codes”, IEEE Trans. on Information Theory, Vol. 47, No. 2 (February 2001), pp. 638-656. This article indicates that, through matrix manipulation, the encoding of LDPC codewords can be accomplished in a number of operations that approaches a linear relationship with the size n of the codewords.
More efficient LDPC encoders have been developed in recent years. An example of such an improved encoder architecture is described in U.S. Pat. No. 7,162,684, commonly assigned herewith and incorporated herein by this reference. The selecting of a particular codeword arrangement according to modern techniques is described in U.S. Patent Application Publication No. US 2006/0123277 A1, commonly assigned herewith and incorporated herein by this reference.
On the decoding side, it has been observed that high-performance LDPC code decoders are difficult to implement into hardware. While Shannon's adage holds that random codes are good codes, it is regularity that allows efficient hardware implementation. To address this difficult tradeoff between code irregularity and hardware efficiency, the well-known belief propagation technique provides an iterative implementation of LDPC decoding that can be made somewhat efficient, as described in Richardson, et al., “Design of Capacity-Approaching Irregular Low-Density Parity Check Codes,” IEEE Trans. on Information Theory, Vol. 47, No. 2 (February 2001), pp. 619-637; and in Zhang et al., “VLSI Implementation-Oriented (3,k)-Regular Low-Density Parity-Check Codes”, IEEE Workshop on Signal Processing Systems (September 2001), pp. 25.-36. Belief propagation decoding algorithms are also referred to in the art as probability propagation algorithms, message passing algorithms, and as sum-product algorithms.
In summary, belief propagation algorithms are based on the binary parity check property of LDPC codes. As mentioned above and as known in the art, each check vertex in the LDPC code constrains its neighboring variables to form a word of even parity. In other words, the product of the correct LDPC code word vector with each row of the parity check matrix sums to zero. According to the belief propagation approach, the received data are used to represent the input probabilities at each input node (also referred to as a “bit node”) of a bipartite graph having input nodes and check nodes.
a illustrates an example of such a bipartite graph of the conventional belief propagation algorithm. In
Within each iteration of the belief propagation method, bit probability messages are passed from the input nodes V to the check nodes S, updated according to the parity check constraint, with the updated values sent back to and summed at the input nodes V. The summed inputs are formed into log likelihood ratios (LLRs) defined as:
where c is a coded bit received over the channel. The value of any given LLR L(c) can of course take negative and positive values, corresponding to 1 and 0 being more likely, respectively. The index c of the LLR L(c) indicates the variable node Vc to which the value corresponds, such that the value of LLR L(c) is a “soft” estimate of the correct bit value for that node. In its conventional implementation, the belief propagation algorithm uses two value arrays, a first array L storing the LLRs for j input nodes V, and the second array R storing the results of m parity check node updates, with m being the parity check row index and j being the column (or input node) index of the parity check matrix H. The general operation of this conventional approach determines, in a first step, the R values by estimating, for each check sum S (each row of the parity check matrix), the probability of the input node value from the other inputs used in that checksum. The second step of this algorithm determines the LLR probability values of array L by combining, for each column, the R values for that input node from parity check matrix rows in which that input node participated. A “hard” decision is then made from the resulting probability values, and is applied to the parity check matrix. This two-step iterative approach is repeated until the parity check matrix is satisfied (all parity check rows equal zero), or until another convergence criteria is reached, or until a terminal number of iterations have been executed.
In other words, LDPC decoding process involves the iterative two-step process of:
In practice, the process begins with an initialized estimate for the LLRs L(rj), ∀j, using the received soft data. Typically, for AWGN channels, this initial estimate is
as known in the art, where rj is the received soft symbol value for variable node Vj. The values of check nodes S (i.e., the matrix rows) are also each initialized to zero (Rmj=0, for all m and all j), corresponding to the result for a correct codeword. The per-row (or extrinsic) LLR probabilities are then derived:
L(qmj)=L(qj)−Rmj (1)
for each column j of each row m of the checksum subset. As shown in
for each input node Vj contributing to a given checksum row m. In effect, the amplitude Amj for a column j based on row m, is the sum of the values of a function of those estimates L(qmj) that contribute to the checksum for that row m, other than the estimate for column j itself. An example of a suitable function Ψ is:
Ψ(x)≡log(|tan h(x/2)|) (3)
A sign value smj is determined from:
which is simply an odd/even determination of the number of negative probabilities for a checksum m, excluding column j's own contribution to that checksum m. The updated estimate of each value Rmj then becomes:
R
mj
=−s
mjΨ(Amj) (5)
The negative sign of value Rmj contemplates that the function Ψ is its own negative inverse. The value Rmj thus corresponds to an estimate of the LLR for input node Vj as derived from the other input nodes V that contributed to the mth row of the parity check matrix (check node Sm), not using the value for input node j itself. As shown in
Therefore, in the second step of each decoding iteration, the LLR estimates for each input node are updated over each matrix column (i.e., each input node V) as follows:
where the estimated value Rmj is the most recent update, from equation (5) in this derivation, summed over the other variable nodes V contributing to the checksum for row m, minus the original estimate of the value at variable node Sj. This column estimate L(qj) can then be used to make a “hard” decision check, as mentioned above, to determine whether the iterative belief propagation algorithm can be terminated.
In conventional communications system, the function of LDPC decoding, specifically by way of the belief propagation algorithm, is typically implemented in a sequence of program instructions, as executed by programmable digital logic. For example, the implementation of LDPC decoding in a communications receiver by way of a programmable digital signal processor (DSP) device, such as a member of the C64x family of digital signal processors available from Texas Instruments Incorporated, is commonplace in the art. Following the above description of the belief propagation algorithm, the instructions involved in the updating of the check node values Rmj include the evaluation of equations (3) through (5). Typically, it is contemplated that the evaluation of the function Ψ will typically involve a look-up table access, or alternatively a straightforward arithmetic calculation of an estimate.
Each update also involves the evaluation of the sign value smj as indicated in equation (4); alternatively, this evaluation of the sign value smj may derive the negative sign value −smj, since this negative value is applied in equation (5) in each case. For the example of
s
2,1
=−sgn[L(q2,3)]*sgn[L(q2,6)]*sgn[L(q2,7)] (7a)
s
2,3=−sgn[L(q2,1)]*sgn[L(q2,6)]*sgn[L(q2,7)] (7b)
s
2,6=−sgn[L(q2,1)]*sgn[L(q2,3)]*sgn[L(q2,7)] (7c)
s
2,7=−sgn[L(q2,1)]*sgn[L(q2,3)]*sgn[L(q2,6)] (7d)
where sgn is the “sign” function, returning the polarity of its respective argument. As evident from equations (7a) through (7d), each instance of sgn[L(qmj)] is used three times in these four equations. Accordingly, the set of four equations can be simplified, in the number of multiplications required, by evaluating a product P of all four sgn values:
P=−1*sgn[L(q2,1)]*sgn[L(q2,3)]*sgn[L(q2,6)]*sgn[L(q2,7)] (8)
and then calculating each sign value smj as the product of this product value P with the sign value of its own extrinsic LLR value L(qmj):
s
2,1
=P*sgn[L(q2,1)] (9a)
s
2,3
=P*sgn[L(q2,3)] (9b)
s
2,6
=P*sgn[L(q2,6)] (9c)
s
2,7
=P*sgn[L(q2,7)] (9d)
These sign values smj can then be multiplied by their respective amplitude function values Ψ(Amj) to derive the updated row values Rmj:
R
2,1
=s
2,1*Ψ(A2,1) (10a)
R
2,3
=s
2,3*Ψ(A2,3) (10b)
R
2,6
=s
2,6*Ψ(A2,6) (10c)
R
2,7
=s
2,7*Ψ(A2,7) (10d)
In general, for any row m and column j, the updated row value Rmj can thus be derived as:
R
mj
=s
mj*Ψ(Amj) (10e)
As mentioned above, these calculations are typically done via software, executed by a DSP device, in conventional receiving equipment that is carrying out LDPC decoding. As known in the art, most instruction sets (including those of the C64x DSP devices available from Texas Instruments Incorporated) include a “SGN” function, implementing the evaluation z=SGN(x). This z=SGN(x) function can be defined arithmetically as follows:
As mentioned above, this LDPC decoding operation is conventionally executed by DSP devices, such as a member of the C64x family of DSPs available from Texas Instruments Incorporated. This conventional operation can be coded in C64x assembly code as follows:
As evident from this assembly code, nine C64x DSP assembly instructions are required to carry out the operation of equation 10(e) to update the row value Rmj for a single row m and column j in the decoding process. The latency of each of the non-conditional instructions in this sequence is one machine cycle each; any of the conditional instructions, if executed, have a latency of six cycles according to the C64x DSP architecture. The maximum machine cycle latency for this sequence is therefore eighteen machine cycles, for the case in which B2 is set (i.e., SGN(X) is negative and the attribute value Y is at its maximum negative value).
Machine cycle latency is an important issue, of course, especially in time-sensitive operations such as LDPC decoding, for example such decoding of real-time communications (e.g., VoIP telephony). Another important issue in considering the efficiency and performance of the LDPC decoding process is the number of calculations required to carry out this operation for a typical LDPC code word. For example, under the IEEE 802.16e WiMAX communications standard, a typical code has a ¾ code rate, with a codeword size of 2304 bits and 576 checksum nodes; in this case, as many as fifteen input nodes V may contribute to a given checksum node S (i.e., the maximum row weighting is fifteen). For this example, assuming a modest number of fifty LDPC decoding iterations, the number of instructions to be executed in order to evaluate equation (10e) for a single code word requires 3,888,000 machine cycles. This level of computational effort is, of course, substantial for time-critical applications such as LDPC decoding.
By way of further background, the LDPC decoding process above involves another costly process, as measured by machine cycles. Specifically, it is known in the art to evaluate the amplitude Amj by evaluating equations (2) and (3) as:
A
mj(x,y)=sgn(x)sgn(y)min(|x|,|y≡)+log(1+e−|x+y|)−log(1+e−|x−y|) (11)
with the sgn(x) function defined as above.
The remainder of equation (11), namely the function:
ƒ(x,y)=sgn(x)sgn(y) (12)
requires the calling and executing of several functions. For example, a conventional C code sequence for this function ƒ(x,y)=z=sgn(x)sgn(y) in equation (12) can be written:
This sequence can be written in C64x assembly code as follows:
The evaluation of the function ƒ(x,y)=z=sgn(x)sgn(y), as part of the evaluation of equation (11), thus requires the execution of six instructions, and involves a latency of eleven machine cycles, considering the conditional MVK instruction to itself have a latency of six machine cycles. But this sequence must be repeated many times in the LDPC decoding of each code word, specifically in each row update iteration. For the example used above for the IEEE 802.16e WiMAX communications standard, at a ¾ code rate, with a codeword size of 2304 bits and 576 checksum nodes, and a maximum row weighting is fifteen, the number of machine cycles required for the function of equation (12) amounts to about 2,592,000 machine cycles (50×576×15×6).
Embodiments of this invention provide a method and circuitry that improve the efficiency of redundant code decoding in modern digital circuitry, particularly such decoding as performed iteratively.
Embodiments of this invention provide such a method and circuitry that can reduce the number of machine cycles required to perform a calculation useful in such decoding.
Embodiments of this invention provide such a method and circuitry that can reduce the machine cycle latency for such decoding calculations.
Embodiments of this invention provide such a method and circuitry that can be used in place of calculations in general arithmetic and logic instructions.
Embodiments of this invention provide such a method and circuitry that can be efficiently implemented into programmable digital logic, by way of instructions and dedicated logic for executing those instructions.
Embodiments of the invention may be implemented into an instruction executed by programmable digital logic circuitry, and into a circuit within such digital logic circuitry. The instruction has two arguments, one argument being a signed value, the sign of which determines whether to invert the sign of a second argument, which is also a signed value. The instruction returns a value that has a magnitude equal to that of the second argument, and that has a sign based on the sign of the second argument, inverted if the sign of the first argument is negative.
Embodiments of the invention may also be implemented in circuitry for executing this instruction, in the form of a first multiplexer for selecting between the second argument and a positive maximum value, depending on a comparison of the second argument value relative to a negative maximum value, and a second multiplexer for selecting between the second argument value itself and the output of the first multiplexer, depending on the sign of the first argument.
Embodiments of the invention may also be implemented into another instruction executed by programmable digital logic circuitry, and into a circuit within such digital logic circuitry. This instruction has two arguments, both signed values. An exclusive-OR of the sign bits of the two arguments controls a multiplexer to select between a 2's-complement “1” value for the desired level of precision (e.g., 0b00000001) or a 2's-complement “−1” value (e.g., 0b11111111). Circuitry can be constructed to perform this operation in a single machine cycle, by way of a single bit XOR and a multiplexer. This circuitry can be easily parallelized for wide data path processors.
a is a diagram, in Tanner diagram form, of a conventional LDPC decoder according to a belief propagation algorithm.
b is a plot of the evaluation of a log function, and an estimate for the log function, in conventional LDPC decoding.
a and 6b are register-level diagrams illustrating the arrangement of logic blocks within the DSP co-processor of
c is a register-level diagram illustrating the arrangement of logic blocks within the DSP co-processor of
The invention will be described in connection with its preferred embodiment, namely as implemented into programmable digital signal processing circuitry in a communications receiver. However, it is contemplated that this invention will also be beneficial when implemented into other devices and systems, and when used in other applications that utilize the types of calculations performed by this invention. Accordingly, it is to be understood that the following description is provided by way of example only, and is not intended to limit the true scope of this invention as claimed.
Wireless network adapter 25 in this example includes digital signal processor (DSP) subsystem 35, coupled to host interface 32. The construction of DSP subsystem 35 in connection with this preferred embodiment of the invention, will be described in further detail below. In this embodiment of the invention, DSP subsystem 35 carries out functions involved in baseband processing of the data signals to be transmitted over the wireless network link, and data signals received over that link. In that regard, this baseband processing includes encoding and decoding of the data according to a low density parity check (LDPC) code, and also digital modulation and demodulation for transmission of the encoded data, in the well-known manner for orthogonal frequency division multiplexing (OFDM) or other modulation schemes, according to the particular protocol of the communications being carried out. In addition, DSP subsystem 35 also preferably performs Medium Access Controller (MAC) functions, to control the communications between network adapter 25 and various applications, in the conventional manner.
Transceiver functions are realized by network adapter 25 by the communication of digital data between DSP subsystem 35 and digital up/down conversion function 34. Digital up/down conversion functions 34 perform conventional digital up-conversion of data to be transmitted from baseband to an intermediate frequency, and digital down-conversion of received data from the intermediate frequency to baseband, in the conventional manner. An example of a suitable integrated circuit for digital up/down conversion function 34 is the GC5016 digital up-converter and down-converter integrated circuit available from Texas Instruments Incorporated. Up-converted data to be transmitted is converted from a digital form to the analog domain by digital-to-analog converters 33D, and applied to intermediate frequency transceiver 36; conversely, intermediate frequency analog signals corresponding to those received over the network link are converted into the digital domain by analog-to-digital converters 33A, and applied to digital up/down conversion function 34 for conversion into the baseband. Intermediate frequency transceiver 36 may be realized, for example, by the TRF2432 dual-band intermediate frequency transceiver integrated circuit available from Texas Instruments Incorporated.
Radio frequency (RF) “front end” circuitry 38 is also provided within wireless network adapter 25, in this implementation of the preferred embodiments of the invention. As known in the art, RF front end 38 such analog functions as analog filters, additional up-conversion and down-conversion functions to convert intermediate frequency signals into and out of the high frequency RF signals (e.g., at Gigahertz frequencies, for WiMAX communications) in the conventional manner, and power amplifiers for transmission and receipt of RF signals via antenna A. An example of RF front end 38 suitable for use in connection with this preferred embodiment of the invention is the TRF2436 dual-band RF front end integrated circuit, available from Texas Instruments Incorporated.
Referring now to
DSP subsystem 35 includes DSP core 40, which is a full performance digital signal processor (DSP) as a member of the C64x family of digital signal processors available from Texas Instruments Incorporated. As known in the art, this family of DSPs are of the Very Long Instruction Word (VLIW) type, for example capable of pipelining on eight simple, general purpose, instructions in parallel. This architecture has been observed to be particularly well suited for operations involved in the modulation and demodulation of large data block sizes, as involved in digital communications. In this example, DSP core 40 is in communication with local bus LBUS, to which data memory resource 42 and program memory resource 44 are connected in the example of
According to this preferred embodiment of the invention, DSP co-processor 48 is also provided within DSP subsystem 35, and is also coupled to local bus LBUS. DSP co-processor 48 is realized by programmable logic for carrying out the iterative, repetitive, and preferably parallelized, operations involved in LDPC decoding (and, to the extent applicable for transceiver 20, LDPC encoding of data to be transmitted). As such, DSP co-processor 48 appears to DSP core 40 as a traditional co-processor, which DSP core 40 accesses by forwarding to DSP co-processor 48 a higher-level instruction (e.g., DECODE) for execution, along with a pointer to data memory 42 for the data upon which that instruction is to be executed, and a pointer to data memory 42 to the destination location for the results of the decoding.
According to this preferred embodiment of the invention, DSP co-processor 48 includes its own LDPC program memory 54, which stores instruction sequences for carrying out LDPC decoding operations to execute the higher-level instructions forwarded to DSP co-processor 48 from DSP core 40. DSP co-processor 48 also includes register bank 56, or another memory resource or data store, for storing data and results of its operations. In addition, DSP co-processor 48 includes logic circuitry for fetching, decoding, and executing instructions and data involved in its LDPC operations, in response to the higher-level instructions from DSP core 40. For example, as shown in
According to the preferred embodiment of the invention, DSP co-processor 48 includes SGNFLIP logic circuitry 50, which is specific logic circuitry for executing a SGNFLIP instruction useful in the LDPC decoding of a data word. And, according to this preferred embodiment of the invention, SGNFLIP logic circuitry 50 is arranged so the SGNFLIP instruction is executed with minimum latency, and with minimum machine cycles, greatly improving the efficiency of the overall LDPC decoding operation.
According to the preferred embodiment of this invention, the SGNFLIP instruction is an instruction, executable by DSP co-processor 48 or by other programmable digital logic, that performs the function:
SGNFLIP (x, y)=sgn(x)*y
where x and y are n-bit operands, for example as stored in a location of register bank 56 of DSP co-processor 48 (or a register in such other programmable digital logic executing the SGNFLIP instruction). Also according to this preferred embodiment of the invention, an absolute value function (e.g., an ABS(x) instruction) can be evaluated by executing the SGNFLIP instruction using the same operand x as both arguments in the function:
SGNFLIP (x, x)=sgn(x)*x=|x|
In this case, if x is a negative value, multiplying x by its negative sign will return a result equal to the positive magnitude of x; of course, if x is positive, the result will also be the positive magnitude of x.
According to this invention, SGNFLIP logic circuitry 50 is arranged to execute this SGNFLIP instruction in an especially efficient manner.
Logic block 55 receives an n-bit digital word (e.g., n=16) corresponding to operand y at one input, and receives the most significant bit of operand x at another input. In this realization, as will become evident from this description, logic block 55 carries out its operations using 2's-complement integer arithmetic. The digital word corresponding to operand y is applied to bit inversion function 60, which inverts the state of each bit of operand y, bit-by-bit. This bit inverted operand y is applied to incrementer 61, which effectively adds a binary “1” value, producing an n-bit value corresponding to the 2's-complement arithmetic inverse of operand y. This inverse value is applied to one input of multiplexer 62, specifically to the input that is selected by multiplexer 62 in response to a “0” value at its control input. The second input of multiplexer 62, specifically the input selected in response to a “1” value at the control input of multiplexer 62, is the maximum positive value for an n-bit 2's-complement word, namely 2(n−1)−1.
The digital word corresponding to operand y is also applied to comparator 64, which compares its value against the maximum negative value for an n-bit 2's-complement digital word, namely −2(n−1). The output of comparator 64 is applied to the control input of multiplexer 62. If operand y represents this maximum negative value, comparator 64 presents a “1” value (i.e., TRUE) to the control input of multiplexer 62; if operand y represents a value other than the maximum negative value, it presents a “0” value (i.e., FALSE) to that input.
The output of multiplexer 62 is applied to one input of multiplexer 65, specifically the input selected by a “1” value at the control input of multiplexer 62. The digital word representing operand y itself is presented to another input of multiplexer 65, specifically the input selected by a “0” value at the control input of multiplexer 65. The sign bit (i.e., the MSB of the n-bit 2's-complement word) of operand x is applied to the control input of multiplexer 65. The output of multiplexer 65 presents the output of logic block 55, as a digital word representing the value of SGNFLIP(x, y).
In operation, operand y itself is presented at one input of multiplexer 65, and multiplexer 62 presents the 2's-complement arithmetic inverse of operand y (as produced by bit inversion 60 and incrementer 61) to a second input of multiplexer 65. The special case in which operand y equals the 2's-complement maximum negative value is handled by comparator 64, which instructs multiplexer 62 to select the hard-wired 2's-complement maximum positive value in that event. As such, multiplexer 65 is presented with the value of operand y and its arithmetic inverse, and selects between these inputs in response to the sign bit of operand x.
Considering the construction of logic block 55 as shown in
The SGNFLIP(x, y) function can be expressed in conventional assembly language format by way an instruction with register locations as its arguments:
a illustrates the operation of the SGNFLIP instruction according to this preferred embodiment of the invention, as a register-level diagram. As shown in
As discussed above in the Background of the Invention, LDPC decoding involves the evaluation of Rmj=smj*Ψ(Amj) in the row update process, in which the values Rmj are recalculated for each updated column estimate for the input nodes, or variable nodes, contributing to that row of the parity check matrix. As such, the SGNFLIP instruction evaluates this function applying Ψ(Amj) for a given row and column as the y operand, and the sign value smj as the x operand. As also discussed above, conventional assembly code requires nine C64x DSP assembly instructions and thus nine machine cycles to carry out that function, for a single row m and column j. In IEEE 802.16e WiMAX communications, this conventional approach to evaluation of the function z=SGN(x)*Ψ(Amj) requires 3,888,000 machine cycles for each code word, in the case of a ¾ code rate with a codeword size of 2304 bits and 576 checksum nodes, and in which the maximum row weighting is fifteen, assuming fifty iterations to convergence.
On the other hand, according to this embodiment of the invention, only a single machine cycle is required for execution of the SGNFLIP instruction by DSP co-processor 48. In LDPC decoding of the same 802.16e codeword of 2304 bits, with 576 checksum nodes, a ¾ code rate, and maximum row weighting of fifteen, only 432,000 machine cycles are required, over the same fifty iterations. In addition, the total latency for this operation is reduced from a maximum of eighteen machine cycles for the conventional case, to a single machine cycle. Other code rates, codeword sizes, etc. will also see a reduction in the computational time by a factor of nine, according to this embodiment of the invention.
As mentioned above, logic block 55 is described as operating on sixteen-bit digital words, one at a time. However, many modern DSP integrated circuits and other programmable logic have much wider datapaths than sixteen bits. For example, it is contemplated that some modern processors, including DSPs, have or will realized data paths as wide as 128 bits for each data word, covering eight sixteen-bit data words.
It has been discovered, according to this preferred embodiment of the invention, that LDPC decoding row update operations, including the SGNFLIP function, can be readily parallelized, in that each data value used in each row update operation is independent and not affected by other data values. In other words, the column updates for an iteration are performed and are complete prior to initiating the next row update operation using those column updates. Accordingly, SGNFLIP logic circuitry 50 of DSP co-processor 48 can be realized by way of eight parallel logic blocks 55, each operating independently on their own individual sixteen-bit data words.
It is also contemplated that this parallelism can be easily generalized for other data word widths fitting within the ultra-wide data path. For example, if the data word (i.e., operand precision) is thirty-two bits in width, each pair of logic blocks 55 can be combined into a single thirty-two bit logic block, providing four thirty-two bit SGNFLIP operations in parallel within SGNFLIP logic circuitry 50. It is contemplated that the logic involved in selectably combining pairs of logic blocks 55 can be readily derived by those skilled in the art having reference to this specification, for a given desired data path width, operand precision, and number of operations to be performed in parallel.
According to another preferred embodiment of the invention, DSP co-processor 48 includes SGNPROD logic circuitry 51, which is specific logic circuitry for executing a SGNPROD instruction that is also useful in the LDPC decoding of a data word. As will be described in further detail below, according to this preferred embodiment of the invention, this SGNPROD instruction can be executed with minimum latency, and with minimum machine cycles. The efficiency of the LDPC decoding process can also be improved by way of this SGNPROD logic circuitry 51.
In addition, those skilled in the art having reference to this specification will readily recognize that SGNPROD logic circuitry 51 can be realized in combination with SGNFLIP logic circuitry 50 described above. Alternatively, either of SGNPROD logic circuitry 51 and SGNFLIP logic circuitry 50 may be implemented individually, without the presence of the other, if the LDPC or other DSP operations to be performed by DSP co-processor 48 warrant; furthermore, either or both of these logic circuitry functions may be realized within DSP core 40, or in some other arrangement as desired for the particular application.
According to the preferred embodiment of this invention, the SGNPROD instruction is an instruction that is executable by DSP co-processor 48, or alternatively by other programmable digital logic, to evaluate the function:
SGNPROD(x, y)=sgn(x)*sgn(y)
where x and y are n-bit operands, for example as stored in a location of register bank 56 of DSP co-processor 48 (or a register in such other programmable digital logic executing the SGNFLIP instruction). This SGNPROD function returns a value of +1, if the signs of operands x, y are both positive or both negative, or a value of −1, if the signs of operands x, y are opposite from one another; this result is preferably communicated as a 2's-complement value (i.e., 0b00000001 for +1, and 0b11111111 for −1).
Logic block 65 receives n-bit digital words (e.g., n=8) corresponding to operands x and y at its inputs. As suggested in
In operation, therefore, logic block 65 produces either the 2's-complement word for the value +1 or the 2's-complement word for the value −1 in response to the exclusive-OR of the sign bits of operands x and y, which corresponds to the product of these two signs. And considering the construction of logic block 65, involving only a single logic function (exclusive-OR function 67) and a single multiplexer (multiplexer 68) with hard-wired inputs, the time required for evaluation of the SGNPROD(x,y) is only the propagation delays of the signals through these two circuits. The execution of the SGNPROD instruction can therefore be accomplished well within a single machine cycle, with a latency of only a single machine cycle.
The SGNPROD(x, y) function can be expressed in conventional assembly language format by way of an instruction with register locations as its arguments:
It is contemplated that the register-level representation of the SGNPROD function executed by logic block 65 will correspond to that shown for the SGNFLIP instruction in
As mentioned above, logic block 65 is described as operating on two digital words at a time. However, as discussed above, many modern DSP integrated circuits and other programmable logic have very wide datapaths. Therefore, as in the case of SGNFLIP logic circuitry 50 described above relative to
Referring now to
The architecture of DSP co-processor 48, as shown in
Referring to cluster 700 by way of example (it being understood that cluster 701 is similarly constructed), six sub-clusters 72L0, 74L0, 76L0, 72R0, 74R0, 76R0 are present within cluster 700. According to this implementation, each sub-cluster 72L0, 74L0, 76L0, 72R0, 74R0, 76R0 is constructed to execute certain generalized arithmetic or logic instructions in common with the other sub-clusters 72L0, 74L0, 76L0, 72R0, 74R0, 76R0, and is also constructed to perform certain instructions with particular efficiency. For example, as suggested by
According to this implementation, each sub-cluster 72L0, 74L0, 76L0, 72R0, 74R0, 76R0 is itself realized by multiple execution units. By way of example,
According to the preferred embodiments of the invention, SGNFLIP logic circuitry 50 and SGNPROD logic circuitry 51 can be implemented in each of main execution unit 90 and secondary execution unit 94, in each of sub-clusters 72L0, 74L0, 76L0, 72R0, 74R0, 76R0 in cluster 700; by extension, each of sub-clusters sub-cluster 72L1, 74L1, 76L1, 72R1, 74R1, 76R1 of cluster 701 can also each have two instances of each of SGNFLIP logic circuitry 50 and SGNPROD logic circuitry 51. Alternatively, SGNFLIP logic circuitry 50 and SGNPROD logic circuitry 51 can be realized in only one type of sub-clusters 72L0, 74L0, 76L0, 72R0, 74R0, 76R0, for example only in arithmetic sub-clusters 74L0, 74R0, if desired. Furthermore, as described above relative to
Referring back to
Each sub-cluster 72, 74, 76 in cluster 700 is bidirectionally connected to crossbar switch 760. Crossbar switch 760 manages the communication of data into, out of, and within cluster 700, by coupling individual ones of the sub-clusters 72, 74, 76 to another sub-cluster within cluster 700, or to a memory resource. As discussed above, these memory resources include global memory (left) 82L and global memory (right) 82R. As evident in
According to this architecture, global register files 80 provide faster data communication among clusters 70. As shown in
It is contemplated that the architecture of DSP co-processor 48 described above relative to
While the invention has been described according to its preferred embodiments, it is of course contemplated that modifications of, and alternatives to, these embodiments, such modifications and alternatives obtaining the advantages and benefits of this invention, will be apparent to those of ordinary skill in the art having reference to this specification and its drawings. It is contemplated that such modifications and alternatives are within the scope of this invention as subsequently claimed herein.