The present application relates generally to a wireless communication device and, more specifically, to performing a multiply-accumulate process using input data received by a an efficient multiply-accumulate processor for software defined radio.
Wireless communications utilize digital filters for signal processing. In signal processing, implementing a digital filter with a general purpose (GP) central processing unit (CPU)/Digital Signal Processor (DSP) that have a power that is too high is a low efficiency solution for Finite Impulse Response (FIR)/Fast Fourier Transform (FFT). The WiXLE BB system is OFDM-based and requires the data symbols (post modulation) to be converted to Time-Domain by performing Inverse Discrete Fourier Transform (IDFT). The OFDM numerology of the WiXLE system consists of 29=512 sub-channels which mean that for every 512 symbols there are corresponding 512 sub-carriers. Since the number of data symbols is a power of 2, then a 512-point Inverse Fast Fourier Transform (IFFT) algorithm can be applied in the transmitter side instead of IDFT, and the corresponding 512-point FFT is used in the receiver side instead of DFT. The reason for using the FFT instead of DFT in this case is the reduced implementation complexity. While the FFT algorithm complexity is of O(NlogN) where N is the number of FFT points (i.e. 512), and the DFT complexity is of O(N2). However, in the case of WiXLE, where the expected data rate is in the order of 10s of Gbps which dictates extremely short OFDM symbol time, it requires the FFT implementation to be extremely high power efficient while still providing the highest BER performance. Further, there are several critical parameters, which are not independent of each other, that impact the FFT power efficiency.
A Multiply-Accumulate (MAC) processor machine is provided. The MAC processor machine includes an input interface configured to receive a number N of data symbols, the number N of data symbols being a power of 2. The MAC processor machine includes a number of multiply-accumulate (MAC) blocks. Each MAC block is configured to in response to receiving a pair of data symbols, execute a butterfly algorithm, generating a corresponding pair of intermediate results algorithm by calculating complex products and sums using the received pair of data symbols and twiddle factors. The MAC processor machine includes a memory configured to store the N received data symbols, the twiddle factors, and the intermediate results of the butterfly algorithm. The MAC processor machine includes a configurable instruction set digital signal processor core configured to: select and read at least one pair of the N received data symbols from a location in the memory; input each of the selected pair of the N received data symbols to the MAC blocks; write, to the location, the intermediate results the MAC blocks generated using the selected at least one pair of the N received data symbols; and output N binary symbols. Each binary symbol output from MAC processor machine corresponds to an order of the output and corresponds to a bit-reversal of a corresponding input from the selected pair of received data symbols.
A FFT CRISP for performing FFT and FIR filter processes is provided. The FFT CRISP machine includes an input interface configured to receive a number N of data symbols, the number N of data symbols being a power of 2. The FFT CRISP machine includes a number of multiply-accumulate (MAC) blocks. Each MAC block is configured to in response to receiving a pair of data symbols, execute a butterfly algorithm, generating a corresponding pair of intermediate results by calculating complex products and sums using the received pair of data symbols and twiddle factors. The FFT CRISP machine further includes a memory configured to store the N received data symbols, the twiddle factors, and the intermediate results of the butterfly algorithm. The FFT CRISP machine includes a configurable instruction set digital signal processor core configured to execute the FFT process by: selecting and read at least one pair of the N received data symbols to read from a location in the memory; inputting each of the selected pair of the N received data symbols to the MAC blocks; writing, to the location, the intermediate results the MAC blocks generated using the selected at least one pair of the N received data symbols; and outputting N binary symbols as a FFT of the received N data symbols, each binary symbol corresponding to an order of the output and corresponding to a bit-reversal of a corresponding input from the selected pair of received data symbols.
A method of computing a Fast Fourier Transform (FFT) of data symbols inputted to a FFT context-based reconfigurable instruction set processor (CRISP) machine is provided. The method includes receiving a number N of the data symbols into an input interface of the FFT CRISP machine, the number N being a power of 2. The method includes in response to receiving a pair of data symbols by a number of multiply-accumulate (MAC) blocks, executing a butterfly algorithm, generating a corresponding pair of intermediate results by calculating complex products and sums using the received pair of data symbols and twiddle factors. The method also includes storing the N received data symbols, the twiddle factors, and the intermediate results of the butterfly algorithm in a memory. The method includes selecting and reading, by a configurable instruction set digital signal processor core, at least one pair of the N received data symbols to read from a location in the memory. Also, the method includes inputting each of the selected pair of the N received data symbols to the MAC blocks. The method includes writing, to the location, the intermediate results the MAC blocks generated using the selected at least one pair of the N received data symbols. The method includes outputting N binary symbols, each binary symbol corresponding to an order of the output and corresponding to a bit-reversal of a corresponding input from the selected pair of received data symbols.
Before undertaking the DETAILED DESCRIPTION below, it may be advantageous to set forth definitions of certain words and phrases used throughout this patent document: the terms “include” and “comprise,” as well as derivatives thereof, mean inclusion without limitation; the term “or,” is inclusive, meaning and/or; the phrases “associated with” and “associated therewith,” as well as derivatives thereof, may mean to include, be included within, interconnect with, contain, be contained within, connect to or with, couple to or with, be communicable with, cooperate with, interleave, juxtapose, be proximate to, be bound to or with, have, have a property of, or the like; and the term “controller” means any device, system or part thereof that controls at least one operation, such a device may be implemented in hardware, firmware or software, or some combination of at least two of the same. It should be noted that the functionality associated with any particular controller may be centralized or distributed, whether locally or remotely. Definitions for certain words and phrases are provided throughout this patent document, those of ordinary skill in the art should understand that in many, if not most instances, such definitions apply to prior, as well as future uses of such defined words and phrases.
For a more complete understanding of the present disclosure and its advantages, reference is now made to the following description taken in conjunction with the accompanying drawings, in which like reference numerals represent like parts:
The wireless network 100 includes base station (BS) 101, base station (BS) 102, base station (BS) 103, and other similar base stations (not shown). Base station 101 is in communication with base station 102 and base station 103. Base station 101 is also in communication with Internet 130 or a similar IP-based network (not shown).
Base station 102 provides wireless broadband access (via base station 101) to Internet 130 to a first plurality of mobile stations within coverage area 120 of base station 102. The first plurality of mobile stations includes mobile station 111, which can be located in a small business (SB), mobile station 112, which can be located in an enterprise (E), mobile station 113, which can be located in a WiFi hotspot (HS), mobile station 114, which can be located in a first residence (R), mobile station 115, which can be located in a second residence (R), and mobile station 116, which can be a mobile device (M), such as a cell phone, a wireless laptop, a wireless PDA, or the like.
Base station 103 provides wireless broadband access (via base station 101) to Internet 130 to a second plurality of mobile stations within coverage area 125 of base station 103. The second plurality of mobile stations includes mobile station 115 and mobile station 116. In an exemplary embodiment, base stations 101-103 communicate with each other and with mobile stations 111-116 using orthogonal frequency division multiple (OFDM) or orthogonal frequency division multiple access (OFDMA) techniques.
Base station 101 can be in communication with either a greater number or a lesser number of base stations. Furthermore, while only six mobile stations are depicted in
Mobile stations 111-116 access voice, data, video, video conferencing, and/or other broadband services via Internet 130. In an exemplary embodiment, one or more of mobile stations 111-116 is associated with an access point (AP) of a WiFi WLAN. Mobile station 116 can be any of a number of mobile devices, including a wireless-enabled laptop computer, personal data assistant, notebook, handheld device, or other wireless-enabled device. Mobile stations 114 and 115 can be, for example, a wireless-enabled personal computer (PC), a laptop computer, a gateway, or another device.
The transmit path in BS 102 includes channel coding and modulation block 205, serial-to-parallel (S-to-P) block 210, Size N Inverse Fast Fourier Transform (IFFT) block 215, parallel-to-serial (P-to-S) block 220, add cyclic prefix block 225, up-converter (UC) 230. The receive path in MS 116 comprises down-converter (DC) 255, remove cyclic prefix block 260, serial-to-parallel (S-to-P) block 265, Size N Fast Fourier Transform (FFT) block 270, parallel-to-serial (P-to-S) block 275, channel decoding and demodulation block 280.
At least some of the components in
In BS 102, channel coding and modulation block 205 receives a set of information bits, applies LDPC coding and modulates (e.g., QPSK, QAM) the input bits to produce a sequence of frequency-domain modulation symbols. Serial-to-parallel block 210 converts (i.e., de-multiplexes) the serial modulated symbols to parallel data to produce N parallel symbol streams where N is the IFFT/FFT size used in BS 102 and MS 116. Size N IFFT block 215 then performs an IFFT operation on the N parallel symbol streams to produce time-domain output signals. Parallel-to-serial block 220 converts (i.e., multiplexes) the parallel time-domain output symbols from Size N IFFT block 215 to produce a serial time-domain signal. Add cyclic prefix block 225 then inserts a cyclic prefix to the time-domain signal. Finally, up-converter 230 modulates (i.e., up-converts) the output of add cyclic prefix block 225 to RF frequency for transmission via a wireless channel. The signal can also be filtered at baseband before conversion to RF frequency.
The transmitted RF signal arrives at MS 116 after passing through the wireless channel and reverse operations to those at BS 102 are performed. Down-converter 255 down-converts the received signal to baseband frequency and remove cyclic prefix block 260 removes the cyclic prefix to produce the serial time-domain baseband signal. Serial-to-parallel block 265 converts the time-domain baseband signal to parallel time domain signals. Size N FFT block 270 then performs an FFT algorithm to produce N parallel frequency-domain signals. Parallel-to-serial block 275 converts the parallel frequency-domain signals to a sequence of modulated data symbols. Channel decoding and demodulation block 280 demodulates and then decodes (i.e., performs LDPC decoding) the modulated symbols to recover the original input data stream.
Each of base stations 101-103 implement a transmit path that is analogous to transmitting in the downlink to mobile stations 111-116 and implement a receive path that is analogous to receiving in the uplink from mobile stations 111-116. Similarly, each one of mobile stations 111-116 implement a transmit path corresponding to the architecture for transmitting in the uplink to base stations 101-103, such as for an efficient multiply-accumulate processor for software based radio, and implement a receive path corresponding to the architecture for receiving in the downlink from base stations 101-103, such as for an efficient multiply-accumulate processor for software based radio.
The channel decoding and demodulation block 280 decodes the received data. The channel decoding and demodulation block 280 includes a decoder configured to perform a low density parity check decoding operation. In some embodiments, the channel decoding and demodulation block 280 comprises one or more context-based operation reconfigurable instruction set processors (CRISPs), such as the CRISP processor(s) described in one or more of application Ser. No. 11/123,313, filed May 6, 2005 and entitled “Context-Based Operation Reconfigurable Instruction Set Processor And Method Of Operation”; U.S. Pat. No. 7,769,912, filed Jun. 1, 2005 and entitled “MultiStandard SDR Architecture Using Context-Based Operation Reconfigurable Instruction Set Processors”; U.S. Pat. No. 7,483,933, issued Jan. 27, 2009 and entitled “Correlation Architecture For Use In Software-Defined Radio Systems”; application Ser. No. 11/225,479, filed Sep. 13, 2005 and entitled “Turbo Code Decoder Architecture For Use In Software-Defined Radio Systems”; and application Ser. No. 11/501,577, filed Aug. 9, 2006 and entitled “Multi-Code Correlation Architecture For Use In Software-Defined Radio Systems”, all of which are hereby incorporated by reference into the present application as if fully set forth herein.
The WiXLE BB system is OFDM-based and requires the data symbols (post modulation) to be converted to time-domain by performing Inverse Discrete Fourier Transform (IDFT). The OFDM numerology of the WiXLE system includes 29 sub-channels (namely, 5112 sub-channels) which means that for every 512 symbols there exists a corresponding 512 sub-carriers. The number of data symbols is a power of 2, and accordingly, a 512-point Inverse Fast Fourier Transform (IFFT) algorithm can be applied in the transmitter side instead of IDFT, and the corresponding 512-point FFT is used in the receiver side instead of DFT. The reason for using the FFT instead of DFT in this case is the reduced implementation complexity. While the FFT algorithm complexity is of O(NlogN) where N is the number of FFT points (that is, 512 points), and the DFT complexity is of O(N2). However, in the case of WiXLE, where the expected data rate is in the order of 10s of gigabits per second (Gbps), which dictates extremely short OFDM symbol time, the corresponding FFT implementation is extremely high power efficient while still providing the highest BER performance. Several parameters that impact the FFT power efficiency:
1. Input/Output data bit precision
2. Twiddle factor bit precision
3. Intermediate results bit precision
These parameters are not independent to each other. For example, the intermediate results precision is dependent on the FFT Radix and the input data precision.
The FFT IP 300 block is based on a configurable instruction set digital signal processor core called Context-based Reconfigurable Instruction Set Processor (CRISP™) architecture. A FFT CRISP™ IP block is described in reference to
The FFT CRISP™ block 300 is based on Instruction Set architecture and can be used for any algorithm requiring multiplications such as but not limited to complex finite impulse response (FIR) or infinite impulse response (IIR) filters and FFT. The FFT CRISP™ block 300 includes 16× data registers 305 (D0-D15) that, in the case of FFT Mode, are used to store the input data. For example, the FFT CRISP™ block 300 includes input terminals coupled to a data bus 310. The data bus 310 includes four data buses, each configured to transmit 64 bits of data at the same time.
The FFT CRISP™ block 300 includes sixteen Y stored data registers 315 (SD0-SD15) that, in the case of FFT Mode store Twiddle Factor data.
The FFT CRISP™ block 300 includes sixteen Multiply-Accumulate (MAC) blocks 320 that are used to multiply and accumulate intermediate results. The FFT CRISP™ 300 can perform sixteen multiplications per cycle. Accordingly, in only two cycles the FFT CRISP™ 300 can perform eight complex multiplications. The MAC block 320 includes processing circuitry, which can be configured to execute any multiply-accumulate algorithm, such as an FFT process or a digital filter. The MAC block 320 includes a 16 input×16 output interface. Each of the sixteen inputs receives 16 bits at a time, and each of the sixteen outputs outputs 16 bits at a time. That is, a 16×16 MAC block 320 can receive or output 256 bits at once. In certain embodiments, the MAC block 320 is a 24×24 MAC or an 18×18 MAC.
An FIR filter is an example of a digital filter that the MAC block 320 can implement. The FIR filter includes data and coefficients that receive a stream of inputs, such as from a shift register. The data can be received as a single bit or multiple bits. Each input is multiplied by a corresponding coefficient. The output of the FIR filter includes a cumulative sum of the products of each data input multiplied its corresponding coefficient. For example, the output y can be represented by a convolutional equation: y(m)=Σi=0k-1xm-1Gi, where m is the number of inputs, k is the number of coefficients, Gi is the coefficient corresponding to the input xm-1.
The FFT CRISP™ block 300 includes a second input terminal 325 coupled to a P_Bus bus, which is a program bus for the instructions, for receiving twiddle factors. For example, the second input terminal 325 is can receive 16 bits of data at one time.
The DIF FFT MAC algorithm 400 receives 256 bits of input x[0] through x[15] through an input interface of the MAC block 320, and then outputs 256 bits of outputs X[0] through X[15] through an output interface of the MAC block 320. That is, each input 405 corresponds to an output 410, shown by a horizontal line 415 from the input 405 to the output 410 (for example, from input x[0] 405a to output X[0] 410a). The DIF FFT MAC algorithm 400 includes multiple MAC butterfly algorithms.
One butterfly algorithm includes the two horizontal lines 415a and 415i corresponding to the first input/output combination of x[0] and X[0], and the ninth input/output combination of x[8] 405i and X[1] 410i; and the butterfly algorithm includes two criss-crossed diagonal lines 425a and 425i. From the perspective of the input x[0] 405a, the line 425a slopes rightward, towards the output, and the line 425i slopes leftward, away from the output. Similarly, from the perspective of the input x[8] 405i, the line 425i slopes rightward, towards the output, and the line 425a slopes leftward, away from the output. In the butterfly algorithm, operations flow from input to output, rightwardly. The horizontal line 415a includes a first intersection 420a with a rightward line 425a connected to the horizontal line 415i. Each intersection with a rightward sloping line represents a multiplication operation. Accordingly, at the first intersection 420a, the input data x[0] 405a is multiplied by the input data x[8] 405i. The horizontal line 415i includes a first intersection 420i with a rightward line 425i connected to the horizontal line 415a. Accordingly, at the first intersection 420i, the input data s[0] 405i is multiplied by the input data x[0] 405a. Next, in the butterfly algorithm, the horizontal line 415i includes a second intersection with the line 425i sloping leftward with respect to the input x[0]. Each intersection with a leftward sloping line represents an addition operation. Accordingly, at the second intersection 430a, the product of the input x[8] with input x[0] is accumulated with (that is, added to) the input x[0]. More particularly, the second intersection 430a can be represented by the expression (x[0])+(x[8]×x[0]). The second intersection 430i includes a twiddle factor (WNk), namely W160. A twiddle factor is a coefficient multiplied by the results of the operation performed at an intersection 420, 430. Twiddle Factors are defined in Equation 1:
The second intersection 430i can be represented by the expression W160((x[8])+(x[0]×x[8])), where
The DIF FFT MAC algorithm 400 includes log(N) number of stages and N multiplications in each stage. The DIF FFT MAC algorithm 400 uses half as many MAC blocks per stage as number of multiplications in each stage
MAC blocks per stage). The DIF FFT MAC algorithm 400 is based on powers of two. For example, in
Each input/output combination generates an output from the MAC block 320 that is a binary number corresponding to the output and in reverse order corresponding to the orinal of the input. Special attention is should be taken to re-order the FFT output offset locations that are in bit-reversed mode.
For example, the input/output combination of x[0] 405a and X[0] 410a generates the output 0000 corresponding to the X[0] output, and by reversing the bits of the output, the result is binary 0000 corresponding to the ordinal x[0] input. As another example, the input/output combination of x[1] and X[1] generates the output 1000 (namely, the number 8 in binary mode) corresponding to the ordinal X[1] output, and by reversing the bits of the output, the result is binary 0001 corresponding to the ordinal x[1] input. As a further example, the input/output combination of x[5] and X[10] generates the output 1010 (namely, the number 10 in binary mode) corresponding to the ordinal X[10] output, and by reversing the bits of the output, the result is binary 0101 corresponding to the ordinal x[5] input. Further examples of this correlation is described in Table 1 below:
The 8-point Radix-2 FFT architecture 500 includes eight inputs (x[0]-x[7]), eight outputs X[0]-X[7], and three stages. The 8-point DIF FFT MAC algorithm 400 includes multiple (for example, such as four) MAC butterfly algorithms.
In the 8-point FFT DIF architecture 500, the butterfly algorithm of the first stage (Stage 0) spans half of the inputs (i.e., four input separation); the butterfly algorithm of the second stage (Stage 1) spans one-fourth of the inputs (i.e., two input separation); and the butterfly algorithm of the third stage (Stage 2) spans one-eighth of the inputs (i.e., consecutive inputs or one input separation).
One butterfly algorithm includes the two horizontal lines 515a and 515e corresponding to the first input/output combination of x[0] and X[0] and to the fifth input/output combination of x[4] 405i and X[1] 510e, respectively. The butterfly algorithm includes two crisscrossed diagonal lines 525a and 525e that each intersect both horizontal lines 515a and 515e. Each horizontal line includes at least one twiddle factor (WNk), as defined in Equation 1. Accordingly, at the first intersection 520a where line 525a intersects horizontal line 515a, the input data x[0] 505a is multiplied by the twiddle factor W80. The horizontal line 515e includes a first intersection 520e with a rightward line 525e that connects to the horizontal line 515a at the second intersection 530a.
At the first intersection 520e, the input data x[4] 505e is multiplied by the twiddle factor W80. The horizontal line 515a includes a second intersection 530a, where an accumulation operation occurs. At the second intersection 530a, the product at intersection 520a (namely, the twiddle factor multiplied by input x[0]) is accumulated with the product at the intersection 520e (namely, the twiddle factor and input x[4]). The accumulated sum at the second intersection 530a is multiplied by the twiddle factor W80 at the second intersection 530a. At the second intersection 530e, where line 525a intersects horizontal line 515e, the product at intersection 520e (namely, input data x[4] 505a multiplied by the twiddle factor W80) is accumulated with the product at intersection 520a. The accumulated sum at the second intersection 530e is multiplied by the twiddle factor W80 at the second intersection 530e. At the first intersection 520e, the input data s[0] 405i is multiplied by the input data x[0] 405a.
The intermediate results of Stage 0 are inputs to Stage 1. The intermediate results of Stage 0 include the results at the second intersections 530a and 530e. The results at the second intersection 530a can be expressed as ((x[0]×W80)+(x[4]×W80))×W80, and results at the second intersection 530e can be expressed as ((x[4]×W80)+(x[0]×W80))×W80.
Stage 1 includes four butterfly MAC algorithms. One of the Stage 1 butterfly algorithms includes two horizontal lines 515a and 515c, a diagonal line 545a connecting the two horizontal lines 515a and 515c at a third intersection 540a and fourth intersection 550c, and a diagonal line 545c connecting the two horizontal lines 515a and 515c at a third intersection 540c and fourth intersection 550a.
The single stage butterfly MAC algorithm 600, 601 includes 2 points. That is, the 2 point MAC algorithm 600, 601 includes two horizontal lines 615a and 615b corresponding to the first input 605 and output 610 combination of x[0] and X[0] and to the second input/output combination of x[1] and X[1], respectively. The butterfly MAC algorithm 600 includes two crisscrossed diagonal lines 625a and 625b that each intersect both horizontal lines 615a and 615b.
In the first single stage butterfly MAC algorithm 600, a first multiplication operation generates a first product, wherein the input x[0] is multiplied by the twiddle factor Wk at the first intersection on the line 615a with the line 625a. At the same time, a second multiplication operation generates a second product, wherein the input x[1] is multiplied by the twiddle factor WNj at the first intersection on line 615b with the line 625b. Next, a first accumulate operation generates a first sum of the second product with the first product, which can be expressed as X[0]=(x[1]×WNj)+(x[0]×WNi). The first accumulate operation occurs at the second intersection on line 615a with the line 625b. At the same time, a second operation generates a second sum of the first product with the second product. The second accumulate operation occurs at the second intersection on line 615b with the line 625a, where −1 is the multiplier. The output of the second x[1] and X[1] input/output combination can be expressed as which can be expressed as X[1]=(x[0]×WNi)−(x[1]×WNj).
In the first single stage butterfly MAC algorithm 601, a first multiplication operation generates a first product, wherein the input x[0] is multiplied by the twiddle factor −WNi at the first intersection on the line 615a with the line 625a. At the same time, a second multiplication operation generates a second product, wherein the input x[1] is multiplied by the twiddle factor −WNj at the first intersection on line 615b with the line 625b. Next, a first accumulate operation generates a first sum of the second product with the first product. The first accumulate operation occurs at the second intersection on line 615a with the line 625b, where both lines 615a and 625b include a −1 multiplyer. The output 610 can be expressed as X[0]=((x[1]×−WNj)+((x[0]×−WNi)×−1). At the same time, a second operation generates a second sum of the first product with the second product. The second accumulate operation occurs at the second intersection on line 615b with the line 625a, where −1 is the multiplier. The output of the second x[1] and X[1] input/output combination can be expressed as which can be expressed as X[1]=((x[0]×−WNi)×−1+(x[1]×−WNj)).
The inputs x[0] and x[1] are the same for both single stage butterfly MAC algorithms 600 and 601. The outputs X[0] and X[1] are equivalent for both single stage butterfly MAC algorithms 600 and 601. The twiddle factors for each of the single stage butterfly MAC algorithms 600 and 601 include opposite signs.
Each of the data value stored in the memory consists of a complex number, including a real part and an imaginary part. That is, each input data 805 x[0]-x[7] includes a real part and imaginary part. Each twiddle factor stored in memory includes of a real part and imaginary part. A first column block of the memory arrangement 700 includes a 1×N array of input data 805. A second column block of the memory arrangement 700 includes a 1×N array of twiddle factors 810 for the first stage, Stage 0. A third column block of the memory arrangement 700 includes a 1×N array of twiddle factors 815 for the second stage, Stage 1. A fourth column block of the memory arrangement 700 includes a 1×N array of twiddle factors 820 for the second stage, Stage 2. In certain embodiments, the input data x[0] 505a is input to a corresponding block 805a of memory in the memory arrangement 700.
The complex data arrangement 800 includes a first memory block 805a (Memory A), a second memory block 805b (Memory B), a third memory block 805c (Memory C), and a fourth memory block 805d (Memory D). The four memory blocks 805a-d can be accessed in parallel by the FFT CRISP™ core. Each of the memory blocks 805a-805d port size support read/write of two complex data values (for example, a port size supporting read/write of four real data values). For example, in one instant, the FFT CRISP can read in the input data x[0](0,1) as the real part of x[0] into the 0 position and can read in the imaginary part into the 1 position of the Memory A block 805a. In the same instant, the FFT CRISP can read in the input data x[1] (2,3) as the real part of x[1] into the 2 position and can read in the imaginary part of x[1] into the 3 position of the Memory A block 805a.
In the example shown, the FFT CRISP processes four butterfly MAC algorithms per cycle. At a time to, the FFT CRISP machine begins processing N number of input data points from the input data stored in memory blocks, for example, MemoryA0 805a, Memory B0 805b, Memory C0 805c, Memory D0 805d. That is, at time to, a first cycle begins, in which the FFT CRISP reads four values (a0, b0, c0, d0) from each memory block. For example, the value a0 can include the real part of the input x[0]; the value b0 can include the imaginary part of the input x[0]; the value c0 can include the real part of the input x[1]; the value d0 can include the imaginary part of the input x[1]. Also at time t0, the FFT CRISP machine reads in four values from the Memory block B0 805b for the inputs x[2] and x[3]; reads in four values from the Memory block C0 805c for the inputs
and reads in four values from the Memory block D0 805d for inputs
Also at time t0, to process the first cycle of a butterfly MAC, the FFT CRISP machine reads in twiddle factors from a memory block 905a-b separate from the input data memory locks 805a-d, including WN0,0 and WN0,1 from Memory block WA 905a, and including WN0,2 and WN0,3 from Memory block WB 905b. The twelve values read in during the first cycle enables the FFT CRISP to perform four butterflies MAC algorithms: a first butterfly of x[0], x[8], and WN0,0; a second butterfly of x[1], x[9], and WN0,1, a third butterfly of x[2], x[10], and WN0,2; and a fourth butterfly of x[3], x[11], and WN0,3.
During the second cycle, the intermediate results of the first four butterflies are written back the same memory address from which the input values were read in the previous cycle. For example, the intermediate results of the first butterfly (namely, using x[0], x[8], and WN0,0) are stored in MemoryA0 805a including bits 0-63. The intermediate results of the second butterfly (namely, using x[1], x[9], and WN0,1) are stored in MemoryB0 805b including bits 64-127. The intermediate results of the third butterfly (namely, using x[2], x[10], and WN0,2) are stored in MemoryC0 805c including bits 128-191. The intermediate results of the fourth butterfly (namely, using x[3], x[11], and WN0,3) are stored in Memory D0 805d including bits 192-255.
To complete a second set of four butterfly MAC algorithms, at a second time for the second cycle, the FFT CRISP machine reads in input data x[4] and x[5] into the Memory A0 block 805a; reads in input data x[6] and x[7] into the Memory B0 block 805b; reads in input
into Memory C0 block 805c; reads in input
into Memory D0 block 805d. The FFT CRISP machine reads in twiddle factors, including WN0,4 and WN0,5 from Memory block WA, and including WN0,6 and WN0,7 from Memory block WB.
During the second cycle, the FFT CRISP writes values to the memory block a1. For example, the inputs
are written to the memory block a1; the inputs
are written to the memory block a1; the inputs x[2] and
are written to the memory block a1; and inputs
are written to the memory block a1. After the multiplication and accumulation of
In certain embodiments, the Scheduling 900 corresponds to the 16-point DIF Radix-2 FFT algorithm. That is, during the first cycle, x[0] 505a and x[8] are read from the Memory block a0. As a result, data bits cannot be written to a data address that is already in use. These inputs x[0] 505a and x[8] are multiplied by WN0 according to the architecture in
In
The FFT CRISP™ can process other higher Radixes (i.e., Radix-4).
An advantage of using the last two stages in pipeline to begin processing a subsequent set of input data (also referred to as the rest of the processing) is that mathematically (as shown in Equations 2 and 3) the last two stages in the DIF FFT (S9, S10 in
As a result, the intermediate results of the ninth stage (also referred to in this example as the antepenultimate stage) S8 are input to the tenth stage 1010j, 1020a where multiplication by
occurs. The intermediate results from the tenth stage (also referred to in this example as the penultimate stage) S9 are then input to the eleventh, last stage S10 1010k, 1020b where multiplication by
occurs.
In certain embodiments, the FFT CRISP includes a bit reverser 1050 configured receive the output from the last stage and to reorder the bits output from the last stage 1010k, 1020b (for example, S10). In operation, in response to receiving the output X[1]=1000 as shown in the second row of Table 1, the bit reverser 1050 performs bit reversal, outputing 0001.
The program set is based on fully flexible VLIW Microcode instruction set.
1. The Program Register (Pr_data) is 512-bits long and performs routing of the data to-from the memory to the appropriate X/Y register. The Pr_data also performs routing of the Input/Output data to or from the accumulators. Table 2 includes a legend for reading the FFT CRISP Programming Model 1100.
In the FFT Programming Example, the programming code for the FFT Flag 1305 represents parallelism enabled when the bit is a 1 (as shown) and represents parallelism disabled when the bit is a 0. When parallelism is enabled, the last two stages of the FFT CRISP pipeline use the hardware accelerator 1020 and the pipeline schedule 1001, but when parallelism is disabled, the last two stages of the FFT CRISP pipeline use the pipeline schedule 1000.
In the FFT Programming Example, the programming code for the GP LOOP0 Init 1310 indicates a whether to loop the corresponding portion of the code again.
In the FFT Programming Example, the programming code for the Scale 1315 indicates how much to scale the intermediate results of a processing stage by before truncating the intermediate results. Truncating prevents 32-bit saturation of a 16×16 MAC block. Scaling prevents truncating important data during the truncation process that follows scaling. For example, a code of 0 indicates to not scale; a code of 1 indicates to divide by (also referred to as scale by) order of 2; a code of 2 (as shown) indicates to divide by order of 4; and a code of 3 indicates to divide by order of 8. For example, during an FFT processing, the input x[0] is multiplied by the input x[8] in a butterfly MAC algorithm. The product of the inputs x[0] and x[8] are multiplied by a twiddle factor W160. The product of the twiddle factor W160 and two inputs x[0] and x[8] is input to the second accumulator (adder), and then the accumulation result is scaled by a specified scale factor. Then, the scaled product is a 32-bit data that is truncated by 16-bits, resulting in a 16-bit scaled-truncated result that is input to the next stage MAC.
In the Program Loop Continuation, the first four lines 1605 represent processing represent of four stages (for example, S0-S3); the second four lines 1610 represent processing of four stages (for example, S4-S7); the third four lines 1615 represent processing of four stages (for example, S8-S11); and the fourth four lines 1620 represent processing of four stages (for example, S12-S15). That is, each set of four lines 1605, 1610, 1615, 1620 can include a program fixed instruction, similar to the Program Fixed Instruction 1510. In hex mode, the code 1625 and 1630, each represents a loop indicator.
In the table, the Taps 1720 column includes the number of taps, also referred to as the number of points N. In the table, the Performance 1730 column includes the number of cycles to complete the process described in the Application 1710 column. In the Application of a Real FIR filter, the number of taps is N=16. The MAC processor machine (which can include the Virtex-5 FPGA) includes one MAC block for each tap. In the FIR filter process, each MAC block includes one tap, which receives real numbers as inputs. That is, the tap of each MAC receives a real number of input data and a real number coefficient. The MAC processor machine executes an FIR filter process by multiplying each input data by a coefficient corresponding to that input data within the MAC block that received the input data and corresponding coefficient, and by next accumulating the N products in an adder. The MAC processor machine outputs the results from the adder after
cycles (in the example shown,
In the example shown, a 16-tap Real FIR Filter MAC processor machine can perform 16 multiplications per 1 cycle.
In the Application of a Complex FIR filter, the number of complex taps is N=4. The MAC processor machine (which can include the Virtex-5 FPGA) includes one complex MAC block for each complex tap. In the FIR filter process, each MAC block includes one tap, which receives complex numbers as inputs. That is, the tap of each complex MAC receives one complex input data and a complex coefficient; the complex input data includes a real number portion and imaginary number portion. The corresponding complex coefficient includes a real number portion and imaginary number portion. The MAC processor machine executes an FIR filter process by multiplying each input data by a coefficient corresponding to that input data within the complex MAC block that received the input data and corresponding coefficient, and by next accumulating the N products in an adder. The complex MAC block includes four butterfly algorithm MAC blocks. A first of the four butterfly algorithm MAC blocks multiplies the real number portion of the input data by the real number portion of the coefficient. A second of the four butterfly algorithm MAC blocks multiplies the real number portion of the input data by the imaginary number portion of the coefficient. A third of the four butterfly algorithm MAC blocks multiplies the imaginary number portion of the input data by the real number portion of the coefficient. A fourth of the four butterfly algorithm MAC blocks multiplies the imaginary number portion of the input data by the imaginary number portion of the coefficient.
The MAC processor machine outputs the results from the adder after
cycles (in the example shown,
In the example shown, a 4-tap Complex FIR Filter MAC processor machine can perform 4 complex multiplications per 1 cycle. Accordingly, the 4-tap Complex FIR Filter MAC processor machine can perform 16 complex multiplications in 4 cycles.
In the Complex FFT Applications, the performance 1030 is related to the number of tapes by the expression: (N/8)×(Log(N)−2). Accordingly, the 512-Point Complex FFT application has N=512 complex taps and performs complex multiplications in (N/8)×(Log2(N)−2)=64×(9−2)=448 cyles.
Although the present disclosure has been described with examples, various changes and modifications may be suggested to one skilled in the art. It is intended that the present disclosure encompass such changes and modifications as fall within the scope of the appended claims.
None of the description in the present application should be read as implying that any particular element, step, or function is an essential element which must be included in the claim scope: the scope of patented subject matter is defined only by the allowed claims. Moreover, none of these claims are intended to invoke paragraph six of 35 USC §112 unless the exact words “means for” are followed by a participle.
The present application claims priority to U.S. Provisional Patent Application Ser. No. 61/759,891, filed Feb. 1, 2013, entitled “EFFICIENT MULTIPLY-ACCUMULATE PROCESSOR FOR SOFTWARE DEFINED RADIO” and U.S. Provisional Patent Application Ser. No. 61/847,326 filed on Jul. 17, 2013, entitled “EFFICIENT MULTIPLY-ACCUMULATE PROCESSOR FOR SOFTWARE DEFINED RADIO.” The content of the above-identified patent documents are incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
61759891 | Feb 2013 | US | |
61847326 | Jul 2013 | US |