Programmable logic devices, or PLDs, are general-purpose circuits that can be programmed by an end user to perform one or more selected functions. Complex PLDs typically include a number of programmable logic elements and some programmable routing resources. Programmable logic elements have many forms and many names, such as CLBs, logic blocks, logic array blocks, logic cell arrays, macrocells, logic cells, and functional blocks. Programmable routing resources also have many forms and many names.
FPGA resources can be programmed to implement many digital signal-processing (DSP) functions, from simple multipliers to complex microprocessors. For example, U.S. Pat. No. 5,754,459, issued May 19, 1998, to Telikepalli, and incorporated by reference herein, teaches implementing a multiplier using general-purpose FPGA resources (e.g., CLBs and programmable interconnect). Unfortunately, DSP circuits may not make efficient use of FPGA resources, and may consequently consume more power and FPGA real estate than is desirable. For example, in the Virtex family of FPGAs available from Xilinx, Inc., implementing a 16×16 multiplier requires at least 60 CLBs and a good deal of valuable interconnect resources.
In place of RAM blocks 102 of
FPGA 150 does an excellent job of supporting DSP functionality. Complex functions must make use of general-purpose routing and logic, however, and these resources are not optimized for signal processing. Complex DSP functions may therefore be slower and more area intensive than is desirable. There is therefore a need for DSP circuitry that addresses consumer demand for ever faster speed performance without sacrificing the flexibility afforded by programmable logic.
The present invention is directed to systems and methods that address the need for fast, flexible, low-power DSP circuitry. The following discussion is divided into five sections, each detailing specific methods and systems for providing improved DSP performance.
Embodiments of the present invention include the combination of modular DSP circuitry to perform one or more mathematical functions. A plurality of substantially identical DSP sub-modules are substantially directly connected together to form a DSP module, where each sub-modules has dedicated circuitry with at least a switch, for example, a multiplexer, connected to an adder. The DSP module may be further expanded by substantially directly connecting additional DSP sub-modules. Thus a larger or smaller DSP module may be constructed by adding or removing DSP sub-modules. The DSP sub-modules have substantially dedicated communication lines interconnecting the DSP sub-modules.
In an exemplary embodiment of the present invention, an integrated circuit (IC) includes a plurality of substantially directly connected or cascaded modules. One embodiment provides that the control input to the switch connected to an adder in the DSP sub-module may be modified at the operating speed of other circuitry in the IC, hence changing the inputs to the adder over time. In another embodiment a multiplier output and a data input bypassing the multiplier are connected to the switch, thus the function performed by the DSP sub-module may change over time.
A programmable logic device (PLD) in accordance with an embodiment includes DSP slices, where “slices” are logically similar circuits that can be cascaded as desired to create DSP circuits of varying size and complexity. Each DSP slice includes a plurality of operand input ports and a slice output port, all of which are programmably connected to general routing and logic resources. The operand ports receive operands for processing, and a slice output port conveys processed results. Each slice may additionally include a feedback port connected to the respective slice output port, to support accumulate functions in this embodiment, and a cascade input port connected to the output port of an upstream slice to facilitate cascading.
One type of cascade-connected DSP slice includes an arithmetic circuit having a product generator feeding an adder. The product generator has a multiplier port connected to a first of the operand input ports, a multiplicand port connected to a second of the operand input ports, and a pair of partial-product ports. The adder has first and second addend ports connected to respective ones of the partial-product ports, a third addend port connected to the cascade input port, and a sum port. The adder can therefore add the partial products, to complete a multiply, or add the partial products to the output from an upstream slice. The cascade and accumulate connections are substantially direct (i.e., they do not traverse the general purpose interconnect) to maximize speed performance, reduce demand on the general purpose interconnect, and reduce power.
One embodiment of the present invention includes an integrated circuit including: a plurality of digital signal processing (DSP) elements, including a first DSP element and a second DSP element, where each DSP element has substantially identical structure and each DSP element has a switch connected to a hardwired adder; and a dedicated signal line connecting the first DSP element to the second DSP element. Additionally, the switch includes a multiplexer that selects the inputs into the hardwired adder.
Another embodiment of the present invention includes an integrated circuit including: a plurality of configurable function blocks; programmable interconnect resources connecting some of the plurality of configurable function blocks; a plurality of digital signal processing (DSP) elements, including a first DSP element and a second DSP element, where each DSP element has substantially identical structure and includes a switch connected to a hardwired adder; and a dedicated signal line connecting the first DSP element to the second DSP element, where the dedicated signal line does not include any of the programmable interconnect resources.
Yet another embodiment of the present invention has integrated circuit having: a plurality of digital signal processing (DSP) elements, including a first DSP element and a second DSP element, each DSP element having substantially identical structure and each DSP element including a hardwired multiplier; and a dedicated signal line connecting the first DSP element to the second DSP element.
A further embodiment of the present invention includes a DSP element in an integrated circuit having: a first switch; a multiplier circuit connected to the first switch; a second switch, the second switch connected to the multiplier circuit; and an adder circuit connected to the second switch.
In one embodiment of the present invention the contents of the one or more mode registers can be altered during device operation to change DSP functionality. The mode registers connect to the general interconnect, i.e., the programmable routing resources in a PLD, and hence can receive control signals that alter the contents of the mode registers, and therefore the DSP functionality, without needing to change the contents of the configuration memory of the device. In one embodiment, the mode registers may be connected to a control circuit in the programmable logic, and change may take on the order of nanoseconds or less, while reloading of the configuration memory may take on the order of microseconds or even milliseconds depending upon the number of bits being changed. In another embodiment the one or more mode registers are connected to one or more embedded processors such as in the Virtex II Pro from Xilinx Inc. of San Jose, Calif., and hence, the contents of the mode registers can be changed at substantially the clock speed of the embedded processor(s).
Changing DSP resources to perform different DSP algorithms without writing to configuration memory is referred to herein as “dynamic” control to distinguish programmable logic that can be reconfigured to perform different DSP functionality by altering the contents of the configuration memory. Dynamic control is preferred, in many cases, because altering the contents of the configuration memory can be unduly time consuming. Some DSP applications do not require dynamic control, in which case DSP functionality can be defined during loading (or reloading) of the configuration memory.
In other embodiments the FPGA configuration memory can be reconfigured in conjunction with dynamic control, to change the DSP functionality. In one embodiment, the difference between dynamic control of the mode register, to change DSP functionality and reloading the FPGA configuration memory to change DSP functionality, is the speed of change, where reloading the configuration memory takes more time than dynamic control. In an alternative embodiment, with the conventional configuration memory cell replaced with a separately addressable read/write memory cell, there may be little difference and either or both dynamic control or reconfiguration may be done at substantially the same speed.
An embodiment of the present invention includes an integrated circuit having a DSP circuit. The DSP circuit includes: an input data port for receiving data at an input data rate; a multiplier coupled to the input port; an adder coupled to the multiplier by first programmable routing logic; and a register coupled to the first programmable routing logic, where the register is capable of configuring different routes in the first programmable routing logic on at least a same order of magnitude as the input data rate.
Another embodiment of the present invention includes a method for configuring a DSP logic circuit on an integrated circuit where the DSP logic circuit has a multiplier connected to a switch and an adder connected to the switch. The method includes the steps of: a) receiving input data at an input data rate by the multiplier; b) routing the output result from the multiplier to the switch; c) the switch selecting an adder input from a set of adder inputs, where the set of adder inputs includes the output result, where the selecting is responsive to contents of a control register, and where the control register has a clock rate that is a function of the input data rate; and d) receiving the adder input by the adder.
A programmable logic device in accordance with one embodiment includes a number of conventional PLD components, including a plurality of configurable logic blocks and some configurable interconnect resources, and some dynamic DSP resources. The dynamic DSP resources are, in one embodiment, a plurality of DSP slices, including at least a DSP slice and at least one upstream DSP slice or at least one downstream DSP slice. A configuration memory stores configuration data defining a circuit configuration of the logic blocks, interconnect resources, and DSP slices.
In one embodiment, each DSP slice includes a product generator followed by an adder. In support of dynamic functionality, each DSP slice additionally includes multiplexing circuitry that controls the inputs to the adder based upon the contents of a mode register. Depending upon the contents of the mode register, and consequent connectivity of the multiplexing circuitry, the adder can add various combinations of addends. The selected addends in a given slice can then be altered dynamically by issuing different sets of mode control signals to the respective mode register.
The ability to alter DSP functionality dynamically supports complex, sequential DSP functionality in which two or more portions of a DSP algorithm are executed at different times by the same DSP resources. In some embodiments, a state machine instantiated in programmable logic issues the mode control signals that control the dynamic functionality of the DSP resources. Some PLDs include embedded microprocessor or microcontrollers and emulated microprocessors (such as MicroBlaze™ from Xilinx Inc. of San Jose, Calif.), and these too can issue mode control signals in place of or in addition to the state machine.
DSP slices in accordance with some embodiments include programmable operand input registers that can be configured to introduce different amounts of delay, from zero to two clock cycles, for example. In one such embodiment, each DSP slice includes a product generator having a multiplier port, a multiplicand port, and one or more product ports. The multiplier and multiplicand ports connect to the operand input ports via respective first and second operand input registers, each of which is capable of introducing from zero to two clock cycles of delay. In one embodiment, the output of at least one operand input register connects to the input of an operand input register of a downstream DSP slice so that operands can be cascaded among a number of slices.
Many DSP circuits and configurations multiply numbers with many digits or bits to create products with significantly more digits or bits. Manipulating large, unnecessarily precise products is cumbersome and resource intensive, so such products are often rounded to some desired number of bits. Some embodiments employ a fast, flexible rounding scheme that requires few additional resources and that can be adjusted dynamically to change the number of bits involved in the rounding.
DSP slices adapted to provide dynamic rounding in accordance with one embodiment include an additional operand input port receiving a rounding constant and a correction circuit that develops a correction factor based upon the sign of the number to be rounded. An adder then adds the number to be rounded to the correction factor and the rounding constant to produce the rounded result. In one embodiment, the correction circuit calculates the correction factor from the signs of a multiplier and a multiplicand so the correction factor is ready in advance of the product of the multiplier and multiplicand.
In a rounding method, for rounding to the nearest integer, carried out by a DSP slice adapted in accordance with one embodiment, the DSP slice stores a rounding constant selected from the group of binary numbers 2(N−1) and 2(N−1)−1, calculates a correction factor from a multiplier sign bit and a multiplicand sign bit, and sums the rounding constant, the correction factor, and the product to obtain N-the rounded product (where N is a positive number). The N least significant bits of the rounded product are then dropped.
DSP slices described herein conventionally include a product generator, which produces a pair of partial products, followed by an adder that sums the partial products. In accordance with one embodiment, the flexibility of the DSP slices are improved by providing multiplexer circuitry between the product generator and the adder. The multiplexer circuitry can provide the partial products to the adder, as is conventional, and can select from a number of additional addend inputs. The additional addends include inputs and outputs cascaded from upstream slices and the output of the corresponding DSP slice. In some embodiments, a mode register controls the multiplexing circuitry, allowing the selected addends to be switched dynamically.
This summary does not limit the invention, which is instead defined by the claims.
The
The following discussion is divided into five sections, each detailing methods and systems for providing improved DSP performance and lower power dissipation. These embodiments are described in connection with a field-programmable gate array (FPGA) architecture, but the methods and circuits described herein are not limited to FPGAs; in general, any integrated circuit (IC) including an application specific integrated circuit (ASIC) and/or an IC which includes a plurality of programmable function elements and/or a plurality of programmable routing resources and/or an IC having a microprocessor or micro controller, is also within the scope of the present invention. Examples of programmable function elements are CLBs, logic blocks, logic array blocks, macrocells, logic cells, logic cell arrays, multi-gigabit transceivers (MGTs), application specific circuits, and functional blocks. Examples of programmable routing resources include programmable interconnection points. Furthermore, embodiments of the invention may be incorporated into integrated circuits not typically referred to as programmable logic, such as integrated circuits dedicated for use in signal processing, so-called “systems-on-a-chip,” etc.
For illustration purposes, specific bus sizes are given, for example 18 bit input buses and 48 bit output buses, and example sizes of registers are given such as 7 bits for the Opmode register, however, it should be clear to one of ordinary skill in the arts that many other bus and register sizes may be used and still be within the scope of the present invention.
DSP Architecture with Cascading DSP Slices
In some FPGAs, each programmable tile includes programmable interconnect elements, i.e., switch (SW) 120 having standardized connections to and from a corresponding switch in each adjacent tile. Therefore, the switches 120 taken together implement the programmable interconnect structure for the illustrated FPGA. As shown by the example of a LB tile 182 at the top of
A BRAM 182 can include a BRAM logic element (BRL 194) in addition to one or more switches. Typically, the number of switches 120 included in a tile depends on the height of the tile. In the pictured embodiment, a BRAM tile has the same height as four CLBs, but other numbers (e.g., five) can also be used. A DSP tile 205 can include, for example, two DSP slices (DSPS 212) in addition to an appropriate number of switches (in this example, four switches 120). An IOB 184 can include, for example, two instances of an input/output logic element (IOL 195) in addition to one instance of the switch 120. As will be clear to those of skill in the art, the actual I/O pads connected, for example, to the I/O logic element 184 are manufactured using metal layered above the various illustrated logic blocks, and typically are not confined to the area of the input/output logic element 184.
In the pictured embodiment, a columnar area near the center of the die (shown shaded in
Some FPGAs utilizing the architecture illustrated in
Note that
For tile 205-1 incoming signals arrive at slices 212-1 and 212-2 on input bus 222. Outgoing signals from OUT_1 and OUT_2 ports are connected to the general interconnect resources via output bus 224.
Respective input and output buses 222 and 224 and the related general interconnect may be too slow, area intensive, or power hungry for some applications. Each DSP slice 212, e.g., 212-1, 212-2, 212-3, and 212-4 (collectively, referred to as DSP slice 212), therefore includes two high-speed DSP-slice output ports input-downstream cascade (IDC) port and OUT port connected to an input-upstream cascade (IUC) port and an upstream-output-cascade (UOC) port, respectively, of an adjacent DSP slice. (As with other designations herein, IDC, accumulate feedback (ACC), IUC, and UOC refer both to signals and their corresponding physical nodes, ports, lines, or terminals; whether a given designation refers to a signal or a physical structure will be clear from the context.).
In the example of
On the input side, DSP logic 307 includes three operand input ports A, B, and C, each of which programmably connects to the general interconnect via a dedicated operand bus. Operand input ports C for both slices 212, e.g., slices 212-1 and 212-2, of a given DSP tile 205, e.g., tile 205-1, share an operand bus and an associated operand register 300, e.g., register 300-1 (i.e., the C register). On the output side, DSP logic 307, e.g., 307-1, and 307-2, has an output port OUT, e.g., OUT1 and OUT2, programmably connected to the general interconnect via bus 175.
Each DSP slice 212 includes the following direct connections that facilitate high-speed DSP operations:
Using
DSP slice 330 receives data from an upstream DSP tile via the IUC and UOC input ports. DSP slice's 330 IDC and OUT output ports are connected to DSP slice's 332 IUC and UOC input ports, respectively. DSP slice 332 sends data to a downstream DSP tile via the IDC and OUT output ports.
In
Routing logic 395 receives inputs from optional register 394, UOC (this is connected to output-downstream cascade (ODC) port of optional pipeline register and routing logic 398 from slice 391), from optional pipeline register and routing logic 392 and feedback from optional pipeline register and routing logic 397. Two outputs from routing logic 395 are input into adder 396 for addition or subtraction. In another embodiment adder 396 may be replaced by an arithmetic logic unit (ALU) to perform logic and/or arithmetic operations. The output of adder 396 is sent to an optional pipeline register and routing logic 397. The output of optional pipeline register and routing logic 397 is OUT which goes to other circuitry on the IC, to routing logic 395 and to ODC which is connected to a downstream slice (not shown).
In an alternative embodiment the OUT of slice 390 can be directly connected to the C input (or A or B input) of an adjacent horizontal slice (not shown). Both slices have substantially the same structure. Hence in various embodiments of the present invention slices may be cascaded vertically or horizontally or both.
The first switch 632 and the second switch 634 in one embodiment include multiplexers having select lines connected to one or more registers. The registers' contents may be changed, if needed, on the order of magnitude of the input data rate (or output data rate). In another embodiment, the first switch 632 has one or more multiplexers whose select lines are connected to configuration memory cells and may only be changed by changing the contents of the configuration memory. A further explanation on reconfiguration is disclosed in U.S. patent application Ser. No. 10/377,857, entitled “Reconfiguration of a Programmable Logic Device Using Internal Control” by Brandon J. Blodget, et. al, and filed Feb. 28, 2003, which is herein incorporated by reference. Like in the previous embodiment, the second switch 634 has its select lines connected to a register (e.g., one or more flip-flops). In yet another embodiment, the first switch 632 and the second switch 634 select lines are connected to configuration memory cells. And in yet still another embodiment, the first switch 632 select lines are connected to a register and the second switch 634 select lines are connected to configuration memory cells.
The switches 630 and 634 may include input and/or output queues such as FIFOs (first-in-first-out queues), pipeline registers, and/or buffers. The multiplier circuit 632 and adder circuit 636 may include one or more output registers or pipeline registers or queues. In one embodiment the first switch 630 and multiplier circuit 632 are absent and the DSP element 660-1 has second switch 634 which receives input line 640 and is connected to adder circuit 636. In yet another embodiment multiplier circuit 632 and/or adder circuit 636 are replaced by arithmetic circuits, that may perform one or more mathematical functions.
As stated earlier embodiments of the present invention are not limited to PLDs or FPGAs, but also include ASICs. In one embodiment, the slice design such as those shown in
Tiles DSPT0 and DSPT1 are identical, each including a pair of identical DSP slices DSPS0 and DSPS1. Each DSP slice in turn includes:
Mode registers 310 connect to the select terminals of multiplexers 420 and 424 and to a control input of adder 426. FPGA 400 can be initially configured so that slices 212 define a desired DSP configuration; and control signals are loaded into mode registers 310 initially and at any further time during device operation via general interconnect 405.
In slice DSPS0 of tile DSPT0, mode register 310 contains mode control signals that operate on multiplexers 420 and 424 and adder 426 to cause the slice to add the product stored in pipeline register 418 to the logic-zero voltage level 422 (i.e., to add zero to the contents of register 418). The mode registers 310 of each of the three downstream slices include a different sets of mode control signals that cause each downstream slice to add the product in the respective pipeline register 418 to the output of the upstream slice.
Y3(N−3)=X(N)H0+X(N−1)H1+X(N−2)H2+X(N−3)H3 (1)
Table 550 provides the output signals OUT0, OUT1, OUT2, and OUT3 of corresponding DSP slices of
Beginning at clock cycle zero, the first input X(0) is latched into each register 414 in the four slices and the four filter coefficients H0-H3 are each latched into one of registers 412 in a respective slice. Each data/coefficient pair is thus made available to a respective product generator 416. Next, at clock cycle one, the products from product generators 416 are latched into respective registers 418. Thus, for example, register 418 within the left-most DSP slice stores product X(0)H3. Up to this point, as shown in Table 550, no data has yet reached product registers 430, so outputs OUT0-OUT3 provide zeroes from each respective slice.
Adders 426 in each slice add the product in the respective register 418 with a second selected addend. In the left-most slice, the selected addend is a hard-wired number zero, so output register 430 captures the contents of register 418, or X0*H3, in clock cycle two and presents this product as output OUT1. In the remaining three slices, the selected addend is the output of an upstream slice. The upstream slices all output zero prior to receipt of clock cycle zero, so the right-most three slices latch the contents of their respective registers 418 into their respective output registers 430.
The cascade interconnections between slices begin to take effect upon receipt of clock cycle 3. Each downstream slice sums the output from the upstream slice with the product stored in the respective register 418. The products from upstream slices are thus cascaded and summed until the right-most DSP slice provides the filtered output Y3(N−3) on a like-named output port. For ease of illustration, FIR filter 500 is limited to two tiles DSPT0 and DSPT1 instantiating a four-tap filter. DSP circuits in accordance with other embodiments include a great many more DSP tiles, and thus support filter configurations having far more taps. Assuming additional tiles, FIR filter 500 of
Dynamic Processing
In the example of
The time it takes to set or update a set of bits in the configuration memory is dependent upon both the configuration clock speed and the number of bits to be set or updated. For example, updated bits belong to one or more frames and these updated frame(s) are then sent in byte serial format to the configuration memory. As an example, let configuration clock be 50 MHz, for 16 bit words or a 16*50 or 800 million bits per second configuration rate. Assume there are 10,000 bits in one frame. Hence it takes about 10,000/800,000,000=13 microseconds to update one frame (or any portion thereof) in the configuration memory. Even if the OpMode register were to use the same clock, i.e., the 50 MHz configuration clock, the OpMode register would be reprogrammed in one clock cycle or 20 nanoseconds. Thus there is a significant time difference between setting or updating the configuration memory and the changing the OpMode register.
Multiplying a first pair of complex numbers a+jb and c+jd provides the following complex product:
R1+jl1=(a+jb)(c+jd)=(ac−bd)+j(bc+ad)=ac−bd+jbc+jad (2)
Similarly, multiplying a second pair of complex number e+jf and g+jh provides:
R2+jl2=(e+jf)(g+jh)=(eg−fh)+j(fg+eh)=eg−fh+jfg+jeh (3)
Summing the products of equations (2) and (3) gives:
(R1+jl1)+(R2+jl2)=ac−bd+jbc+jad+eg−fh+jfg+jeh (4)
Rearranging the terms into real/real, imaginary/imaginary, imaginary/real, and real/imaginary product types gives:
(R1+jl1)+(R2+jl2)=(ac+eg)+(−bd−fh)+(jbc+jfg)+(jad+jeh) (5)
or
(R1+jl1)+(R2+jl2)=R[(ac+eg)+(−bd−fh)]+l[(bc+fg)+(ad+eh)] (6)
The foregoing illustrates that the sum of a series of complex products can be obtained by accumulating each of the four product types and then summing the resulting pair of real numbers and the resulting pair of imaginary numbers. These operations can be extended to any number of pairs, but are limited here to two complex numbers for ease of illustration.
In
DSP slice DSPS0 of tile DSPT0 receives the series of real/real pairs AR(N) and BR(N). Product generator 416 multiplies each pair, and adder 426 adds the resulting product to the contents of output register 430. Output register 430 is preset to zero, and so contains the sum of N real/real products after N+2 clock cycles. The two additional clock cycles are required to move the data through registers 412, 414, and 418. The resulting sum of products is analogous to the first real sum ac+eg of equation 6 above. In another embodiment, output registers 430 need not be preset to zero. State machine 610 can configure multiplexer 424 to inject zero into adder 426 at the time the first product is received. Note: the output register 430 does not need to be set to zero. The first data point of each new vector operation is not added to the current output register 430, i.e., the Opmode is set to standard flow-through mode without the ACC feedback.
DSP slice DSPS1 of tile DSPT0 receives the series of imaginary/imaginary pairs AI(N) and BI(N). Product generator 416 multiplies each pair, and adder 426 subtracts the resulting product from the contents of output register 430. Output register 430 thus contains the negative sum of N imaginary/imaginary products after N+2 clock cycles. The resulting sum of products is analogous to the second real sum −bd−fh of equation 6 above.
DSP slice DSPS0 of tile DSPT1 receives the series of real/imaginary pairs AR(N) and BI(N). Product generator 416 multiplies each pair, and adder 426 adds the resulting product to the contents of output register 430. Output register 430 thus contains the sum of N real/imaginary products after N+2 clock cycles. The resulting sum of products is analogous to the first imaginary sum bc+fg of equation 6 above.
Finally, DSP slice DSPS1 of tile DSPT1 receives the series of imaginary/real pairs AI(N) and BR(N). Product generator 416 multiplies each pair, and adder 426 adds the resulting product to the contents of output register 430. Output register 430 thus contains the sum of N imaginary/real products after N+2 clock cycles. The resulting sum of products is analogous to the second imaginary sum ad+eh of equation 6 above.
Once all the product pairs are accumulated in registers 430, state machine 605 alters the contents of mode registers 310 to reconfigure the four DSP slices to add the two cumulative real sums (e.g., ac+eg and −bd−fh) and the two cumulative imaginary sums (e.g., bc+fg and ad+eh). The resulting configuration 655 is illustrated in
In configuration 655, DSP slice DSPS1 of tile DSPT0 adds the output OUT0 of DSP slice DSPS1, available on upstream output cascade port UOC, to its own output OUT1. As discussed above in connection with
DSP Slices with Pipelining Resources
The contents of register 310 in DSP slice DSPS1 of tile DSPT0 configures that slice to subtract the real product of the imaginary components AI and BI of complex numbers AR+jAI and BR+jBI from the contents of register 430 of upstream slice DSPS0. Slice DSPS1 then stores the resulting real product PR in the one of registers 430 within DSPS1 of tile DSPT0. The input register 705 of slice DSPS1 is configured to impose a two-cycle delay so that the output of the upstream slice DSPS0 is available to add to register 418 of slice DSPS1 at the appropriate clock cycle.
DSP tile DSPT1 works in a similar manner to DSP tile DSPT0 to calculate the imaginary product PI of the same two imaginary numbers. The contents of register 310 in DSP slice DSPS0 of tile DSPT1 configures that slice to add zero to the imaginary product of the real component AR and imaginary component BI of complex numbers AR+jAI and BR+jBI and store the result in the corresponding register 430. The associated input register 705 is configured to impose one clock cycle of delay. The contents of register 310 in DSP slice DSPS1 of tile DSPT1 configures that slice to add the imaginary product of the imaginary component AI and real component BR from the contents of register 430 of the upstream slice DSPS0. Slice DSPS1 of tile DSPT1 then stores the resulting imaginary product PI in the one of registers 430 within DSPS1 of tile DSPT1. The input register 705 of DSP slice DSPS1 is configured to impose two clock cycles of delay so that the output of upstream slice DSPS0 is available to add to register 418 of slice DSPS1.
The configuration of
Each DSP slice of FPGA 900 includes a multiplexer 905 that facilitates pipelining of operands. Multiplexer 424 in each slice includes an additional input port connected to the output of the upstream slice via a shifter 910. Shifter 910 reduces the amount of resources required to instantiate some DSP circuits. The generic example of
Let B=011 and A=00110. The MSB zeroes indicate that A and B are both positive numbers. The product P of A and B is therefore 00010010. Stated mathematically,
P=A×B=00110×011=00010010 (7)
A is broken into two signed numbers A0 and A1, in which case a zero is placed in front of the two least-significant bits to create a positive signed number A0. (This zero stuffing of the LSBs is used for both positive and negative values of A). Thus, A1=001 and A0=010.
DSP slices DSPS0 and DSPS1, as configured in
Input register 705 of slice DSPS0 is configured to introduce just one clock cycle of delay using a single register 710 and a single register 715. After three clock cycles, register 430 contains the product of A0 and B, or 010×011=000110. The two low-order bits of register 430 are provided to a register 434 in the general interconnect 405 as the two low-order product bits P(1:0). In this example, the two low-order bits are “10” (i.e., the logic level on line P(0) is representative of a logic zero, and the logic level on line P(1) is representative of a logic one).
Multiplexer 905 of slice DSPS1 is configured to select input-upstream cascade port IUC, which is connected to the corresponding input-downstream-cascade port IDC of upstream slice DSPS0. Operand B is therefore provided to slice DSPS1 after the one clock cycle of delay imposed by register 705 of slice DSPS0.
Input register 705 of slice DSPS1 is configured to introduce one additional clock cycle of delay on operand B from slice DSPS1 and two cycles of delay on operand A1. The extra clock cycle of delay, as compared with the single clock cycle imposed on operand A0, means that after three clock cycles, register 418 of slice DSPS1 contains the product of A1 and B (001×011=000011) when register 430 of slice DSPS0 contains the product of A0 and B (000110).
Shifter 910 of slice DSPS1 right shifts the contents of the corresponding register 430 (000110) two bits to the right, i.e., while extending the sign bits to fill the resulting new high-order bits, giving 000001. Then, during the fourth clock cycle, slice DSPS1 adds the contents of the associated register 418 with the right-shifted value from slice DSPS0 (000001+000011) and stores the result (000100) in register 430 of slice DSPS1 as the six most significant product bits P(7:2). Combining the low- and high-order product bits P(7:2)=000100 and P(1:0)=10 gives P=00010010. This result is in agreement with the product given in equation 6 above.
In
Y3(N−4)=X(N−4)H0+X(N−5)H1+X(N−6)H2+X(N−7)H3 (8)
Table 1250 illustrates the operation of FIR filter 1200 by presenting the outputs of registers 710, 715, 418, and 1205 for each DSP slice of
Y3(N−6)=X(N−6)H0+X(N−7)H1+X(N−8)H2+X(N−9)H3 (9)
Table 1350 illustrates the operation of FIR filter 1300 by presenting the outputs of registers 710, 715, 418, and 1205 for each DSP slice of
Mode registers 310 store mode control signals that configure FPGA 1400 to operate as a cascaded, integrator-comb, decimation filter that operates on input data X(N), wherein N is e.g. four. Slices DSPS0 and DSPS1 of tile DSPT0 form a two-stage integrator. Slice DSPS0 accumulates the input data X(N) from register 300 in output register 1205 to produce output data Y0(N)[47:0], which is conveyed to multiplexer 424 of the downstream slice DSPS1. The downstream slice accumulates the accumulated results from upstream slice DSPS0 in corresponding output register 1205 to produce output data Y1 (N)[47:0]. Data Y1 (N)[35:0] is conveyed to the A and B inputs of slice DSPS0 of tile DSPT1 via the general interconnect.
Slices DSPS0 and DSPS1 of tile DSPT1 form a two-stage comb filter. Slice DSPS0 of tile DSPT1 subtracts Y1(N−2) from Y1(N) to produce output Y2(N). Slice DSPS1 of tile DSPT0 repeats the same operation on Y2(N) to produce filtered output Y3(N)[35:0].
Dynamic and Configurable Rounding
Many of the DSP circuits and configurations described herein multiply large numbers to create still larger products. Processing of large, unnecessarily precise products is cumbersome and resource intensive, and so such products are often rounded to some desired number of bits. Some embodiments employ a fast, flexible rounding scheme that requires few additional resources and that can be adjusted dynamically to change the number of bits involved in the rounding.
Slice 1500 is similar to the preceding DSP slices, like-identified elements being the same or similar. Slice 1500 additionally includes a correction circuit 1510 having first and second input terminals connected to the respective sign bits of the first and second operand input ports A and B. Correction circuit 1510 additionally includes an output terminal connected to an input of adder 426. Correction circuit 1510 generates a one-bit correction factor CF based on the multiplier sign bit and the multiplicand sign bit. Adder 426 then adds the product from product generator 416 with an X-bit rounding constant in operand register 300 and correction factor CF to perform the round. The length X of the rounding constant in register 300 determines the rounding point, so the rounding point is easily altered dynamically.
Conventionally, symmetric rounding rounds numbers to the nearest integer (e.g., 2.5 rounds to 3, −2.5 rounds to −3, 1.5<=x<2.5 rounds to 2, and −1.5>=x>−2.5 rounds to −2). To accomplish this in binary arithmetic, one can add a correction factor of 0.1000 for positive numbers or 0.0111 for negative numbers and then truncate the resulting fraction. Changing the number of trailing zeroes in the correction factor for positive numbers or the number of trailing ones in the correction factor for negative numbers changes the rounding point. Slice 1500 is modified to automatically round a user-specified number of bits from both positive and negative numbers.
Next, in step 1610, slice 1500 determines the sign of the number to be rounded. If the number is a product of a multiplier in operand register 715 and a multiplicand in operand register 710 (or vice versa), correction circuit 1510 XNORs the sign bits of the multiplier and multiplicand (e.g. the MSBs of operands A and B) to obtain a logic zero if the signs differ or a logic one if the signs are alike. Determining the inverse of the sign expedites the rounding process, though this advanced signal calculation is unnecessary if the rounding is to be based upon the sign of an already computed value.
If the result is positive (decision 1615), correction circuit 1510 sets correction factor CF to one (step 1620); otherwise, correction circuit 1510 sets correction factor CF to zero (step 1625). Adder 426 then sums rounding constant K, correction factor CF, and the result (e.g., from product generator 416) to obtain the rounded result (step 1630). Finally, the rounded result is truncated to the rounding point N, where N−1 is the number of low-order ones in the rounding constant (step 1635). The rounded result can then be truncated by, for example, conveying only the desired bits to the general interconnect.
Table 1 illustrates rounding off the four least-significant binary bits (i.e., N=4) in accordance with one embodiment. The rounding constant in register 300 is set to include N−1 low-order ones, or 0111. In the first row of Table 1, the decimal value and its binary equivalent BV are positive, so correction factor CF, the XNOR of the signs of the multiplier and multiplicand, is one. Adding binary value BV, rounding constant K, and correction factor CF provides an intermediate rounded value. Truncating the intermediate rounded valued to eliminate the N lowest order bits gives the rounded result.
Predetermining the sign of the product expedites the rounding process. The above-described examples employ an XNOR of the sign values of a multiplier and multiplicand to predetermine the sign of the resulting product. Other embodiments predetermine sign values for mathematical calculations in addition to multiplication, such as concatenation for numbers formed by concatenating two operands, in which case there is only one sign bit to consider. In such embodiments, mode register 310 instructs correction circuit 1510 to develop an appropriate correction factor CF for a given operation. An embodiment of correction circuit 1510 capable of generating various forms of correction factor in response to mode control signals from mode register 310 is detailed below in connection with
Complex DSP Slice
DSP slice 1700 communicates with other DSP slices and to other resources on an FPGA via the following input and output signals on respective lines or ports:
Slice 1700 includes a B-operand multiplexer 1705 that selects either the B operand of slice 1700 or receives on the IUC port the B operand of the upstream slice. Multiplexer 1705 is controlled by configuration memory cells (not shown) in this embodiment, but might also be controlled dynamically. The purpose of multiplexer 1705 is detailed above in connection with
A pair of two-deep input registers 1710 and 1715 are configurable to introduce zero, one, or two clock cycles of delay on operands A and B, respectively. Embodiments of registers 1710 and 1715 are detailed below in connection with respective
Slice 1700 caries out multiply and add operations using a product generator 1727 and adder 1719, respectively, of an arithmetic circuit 1717. Multiplexing circuitry 1721 between product generator 1727 and adder 1719 allows slice 1700 to inject numerous addends into adder 1719 at the direction of a mode register 1723. These optional addends include operand C, the concatenation A:B of operands A and B, shifted and unshifted versions of the slice output OUT, shifted and unshifted versions of the upstream output cascade UOC, and the contents of a number of memory-cell arrays 1725. Some of the input buses to multiplexing circuitry 1721 carry less than 48 bits. These input busses are sign extended or zero filled as appropriate to 48 bits.
A pair of shifters 1726 shift their respective input signals seventeen bits to the right, i.e., towards the LSB, by presenting the input signals on bus lines representative of lower-order bits with sign extension to fill the vacated higher order bits. The purpose of shifters 1726 is discussed above in connection with
Product generator 1727 is conventional (e.g. an AND array followed by array reduction circuitry), and produces two 36-bit partial products PP1 and PP2 from an 18-bit multiplier and an 18-bit multiplicand (where one is a signed partial product and the other is an unsigned partial product). Each partial product is optionally stored for one clock cycle in a configurable pipeline register 1730, which includes a pair of 36-bit registers 1735 and respective programmable bypass multiplexers 1740. Multiplexers 1740 are controlled by configuration memory cells, but might also be dynamic.
Adder 1719 has five input ports: three 48-bit addend ports from multiplexers X, Y, and Z in multiplexer circuitry 1721, a one-bit add/subtract line from a register 1741 connected to subtract port SUB, and a one-bit carry-in port CIN from carry-in logic 1750. Adder 1719 additionally includes a 48-bit sum port connected to output port OUT via a configurable output register 1755, including a 48-bit register 1760 and a configurable bypass multiplexer 1765.
Carry-in logic 1750 develops a carry-in signal CIN to adder 1719, and is controlled by the contents of a carry-in select register 1770, which is programmably connected to carry-in select port CIS. In one mode, carry-in logic 1750 merely conveys carry-in signal CI from the general interconnect to the carry-in terminal CIN of adder 1719. In each of a number of other modes, carry-in logic provides a correction factor CF on carry-in terminal CIN. An embodiment of carry-in logic 1750 is detailed below in connection with
Slice 1700 supports many DSP operations, including all those discussed above in connection with previous figures. The operation of slice 1700 is defined by memory cells (not shown) that control a number of configurable elements, including the depth of registers 1710 and 1715, the selected input port of multiplexer 1705, the states of bypass multiplexers 1740 and 1765, and the contents of registers 1725. Other elements of slice 1700 are controlled by the contents of registers that can be written to without reconfiguring the FPGA or other device of which slice 1700 is a part. Such dynamically controlled elements include multiplexing circuitry 1721, controlled by mode register 1723, and carry-in logic 1750, jointly controlled by mode register 1723 and carry-in-select register 1770. More or fewer components of slice 1700 can be made to be dynamically controlled in other embodiments. Registers storing dynamic control bits are collectively referred to as an OpMode register.
The following Table 2A lists various operational modes, or “op-modes,” supported by the embodiment of slice 1700 depicted in
Hold OUT
Feedback Add
OUT Cascade Feedback Add
OUT Cascade Feedback Add Add
Hold OUT
Double Feedback Add
Feedback Add
Multiply-Accumulate
Feedback Add
Double Feedback Add
Feedback Add Add
Feedback Add
Double Add Feedback Add
17-Bit Shift OUT Cascade Feedback Add
17-Bit Shift OUT Cascade Feedback Add Add
17-Bit Shift Feedback
17-Bit Shift Feedback Feedback Add
17-Bit Shift Feedback Add
17-Bit Shift Feedback Multiply Add
17-Bit Shift Feedback Add
17-Bit Shift Feedback Feedback Add Add
17-Bit Shift Feedback Add Add
Table 2B with reference to
Different slices configured using the foregoing operational modes can be combined to perform many complex, “composite” operations. Table 3, below, lists a few composite modes that combine differently configured slices to perform complex DSP operations. The columns of Table 3 are as follows: “composite mode” describes the function performed; “slice” numbers identify ones of a number of adjacent slices employed in the respective composite mode, lower numbers corresponding to upstream slices; “OpMode” describes the operational mode of each designated slice; input “A” is the A operand for a given OpMode; input “B” is the B operand for a given Opmode; and input “C” is the C operand for a given Opmode (“X” indicates the absence of a C operand, and RND identifies a rounding constant of the type described above in connection with
The following Table 4 correlates the composite modes of Table 3 with appropriate operational-mode signals, or “OpMode” signals, and register settings, where:
The columns of Table 5 are as follows: “sequential mode” describes the function performed; “slice” numbers identify one or more slices employed in the respective sequential mode, lower numbers corresponding to upstream slices; “Cycle #” identifies the sequence order of number of operational modes used in a given sequential mode; “OpMode” describes the operational modes for each cycle #; and “OpMode<6:0>” define the 7-bit mode-control signals to the Z, Y, and X multiplexers (see
Table 6, below, correlates the dynamic operational modes of Table 5 with the appropriate inputs, where input “A” is the A operand for a given Cycle #; input “B” is the B operand for a given Cycle #; input “C” is the C operand for a given Cycle # (“X” indicates the absence of a C operand); and “Output” is the output, identified by slice, for a given Cycle #.
Carry-in logic 1750 conventionally delivers carry-in signal CI to adder 1719 (
CINSEL=00: Multiplexer 1915 provides carry-in input CI to adder 1719 via carry-in line CIN.
CINSEL=01: Multiplexer 1915 provides the output of multiplexer 1920 to adder 1719. If slice 1700 is configured to round a product from product generator 1727, OpMode bit OM[1] will be a logic zero. In that case, multiplexer 1920 provides an XNOR of the sign bits of operands A and B to register 1935 and multiplexer 1915. The carry-in signal on line CIN will therefore be the correction factor CF discussed above in connection with
CINSEL=10: This functionality is the same as when CINSEL=01, except that the output of multiplexer 1920 is taken from register 1935. Signal CINSEL is set to 10 when registers 1735 (
CINSEL=11: Multiplexer 1925 decodes OpMode bits OM[6,5,4,1,0] to determine whether slice 1700 is rounding its own output OUT, as for an accumulate operation, or the output of an upstream slice, as for a cascade operation. Accumulate operations select the sign bit OUT[47] of the output of slice 1700, whereas cascade operations select the sign bit UOC[47] of upstream-output-cascade bus UOC. The select terminals of multiplexer 1925 decode the OpMode bits as follows: SELP47=(OM[1]&˜OM[0]) ∥OM[5]∥˜OM[6]∥OM[4], where “&” denotes the AND function, “∥” the OR function, and “˜” the NOT function.
Register 1710, the “A” register, includes two 18-bit collections of cascaded storage elements 2000 and 2005 and a bypass multiplexer 2010. Multiplexer 2010 can be configured to delay A operands by zero, one, or two clock cycles by selecting the appropriate input port. Multiplexer 2010 is controlled by configuration memory cells (not shown) in this embodiment, but might also be controlled dynamically, as by an OpMode register. In the foregoing examples, such as in
It is sometimes desirable to alter operands without interrupting signal processing. It may be beneficial, for example, to change the filter coefficients of a signal-processing configuration without having to halt processing. Storage elements 2000 and 2005 are therefore equipped, in some embodiments, with separate, dynamic enable inputs. One storage element, e.g., 2005, can therefore provide filter coefficients, via multiplexer 2010, while the other storage element, e.g., 2000, is updated with new coefficients. Multiplexer 2010 can then be switched between cycles to output the new coefficients. In an alternative embodiment, register 2000 is enabled to transfer data to adjacent register 2005. In other embodiments, the Q outputs of registers 2000 can be cascaded to the D inputs of registers 2000 in adjacent slices so that new filter coefficients can be shifted into registers 2000 while registers 2005 hold previous filter coefficients. The newly updated coefficients can then be applied by enabling registers 2005 to capture the new coefficients from corresponding registers 2000 on the next clock edge.
Arithmetic Circuit with Multiplexed Addend Input Terminals
The multiplexing circuitry of arithmetic circuit 2600 includes an X multiplexer 2605 dynamically controlled by two low-order OpMode bits OM[1:0], a Y multiplexer 2610 dynamically controlled by two mid-level OpMode bits OM[3:2], and a Z multiplexer 2615 dynamically controlled by the three high-order OpMode bits OM[6:4]. OpMode bits OM[6:0] thus determine which of the various input ports present data to adder 1719. Multiplexers 2605, 2610, and 2615 each include input ports that receive addends from sources other than product generator 1727, and are referred to collectively as “PG bypass ports.” In this example, the PG bypass ports are connected to the OUT port, i.e., OUT[0:48], the concatenation of operands A and B A:B[0:35], the C operand upstream-output-cascade bus UOC, and various collections of terminals held at voltage levels representative of logic zero. Other embodiments may use more or fewer PG bypass ports that provide the same or different functionality as the ports of
If the sum of the outputs of X multiplexer 2605, Y multiplexer 2610, and the carry-in signal CIN are to be subtracted from the Z input from multiplexer 2615, then subtract signal SUB is asserted. The result is:
Result=[Z−(X+Y+Cin)] (8)
The full adders in adder 1719, as will be further described in relation to
Equation 9 shows that subtraction can be done by inverting Z (one's complement) and adding it to the sum of (X+Y+Cin) and then inverting (one's complement) the result.
There are two types of counters, i.e., a (11,4) counter and a (7,3) counter. The counters count the number of ones in the input bits. Hence a (11,4) counter has 11 1-bit inputs that contain up to of 11 logic ones and the number of ones is indicated by a 4-bit output (0000 to 1011). Similarly a (7,3) counter has 7 1-bit inputs that can have up to 7 ones and the number of ones is indicated by a 3-bit output (000 to 111).
There are two types of compressors, i.e., a (4,2) compressor and a (3,2) compressor, where each compressor has one or more adders. The (4,2) compressor has five inputs, i.e., four external inputs and a carry bit input (Cin) and three outputs, i.e., a sum bit (S) and two carry bits (C and Cout). The output bits, S, C, and Cout represent the sum of the 5 input bits, i.e., the four external bits plus Cin. The (3,2) has four inputs, i.e., three external inputs and a carry bit input (Cin) and three outputs, i.e., a sum bit (S) and two carry bit (C and Cout). The output bits, S, C, and Cout, represent the sum of the 4 input bits, i.e., the three external bits plus Cin.
The partial products PP2 and PP1 are transferred via 36-bit buses 2642 and 2644 from compressors 2640 to register bank 1730. With reference to
In an exemplary embodiment the Modified Booth Encoder/Mux 2520 of
The booth encoder coverts the multiplier from a base 2 form to a base 4 form. This reduces the number of partial products by a factor of 2, e.g., in our example from 18 to 9 partial products. For illustration purposes, let X=xm−1, xm−2, . . . , x0, be a binary m-bit number, where m is a positive even number. Then the m-bit multiplier may be written in two-complement form as:
where xi=0,1
An equivalent representation of X in base four is given by:
where x−1=0 and di may have a value of from the set of {−2,−1,0,1,2}.
If the multiplicand has n bits then the XY product is given by;
Pi represents the value X shifted and/or negated according to the value of di. There are m/2 partial products Pi where each partial product has at least n bits. In the case of
For the purposes of illustration let the multiplier be X, where X=QA[0:17] and let Y be the multiplicand, where Y=QB[0:17]. A property of the modified Booth algorithm is that only three bits are needed to determine di. The 18 bits of X are given by x2i+1, x2i, and x2i−1, where i=0, 1, . . . 8. We define x−1=0. For each i, three bits x2i+1, x2i, and x2i−1 are used to determine di by using table 7 below:
Because the partial products are in two's complement form, to obtain the correct value for the sum of the partial products, each partial product would require sign extension. However, the sign extension increases the circuitry needed to multiply two numbers. A modification to each partial product by inverting the most significant bit, e.g., p0 at bit 18 becomes p0_b, and adding a constant 10101010 . . . 101011 starting at the 18th bit, i.e., adding 1 to bit 18 and adding 1 to the right of each partial product, reduces the circuitry needed (more explanation is given in the published paper “Algorithms for Power Consumption Reduction and Speed Enhancement in High-Performance Parallel Multipliers”, by Rafael Fried, presented at the PATMOST'97 Seventh International Workshop Program in Belgium on Sep. 8-10, 1997 and is herein incorporated by reference).
With reference to
Symmetric functions are based on combinations of n variables taken k at a time. For example, for three letters in CAT (n=3), there are three two-letter groups (k=2): CA, CT, and AT. Note order does not matter. Two types of symmetric functions are defined: the XOR-symmetric function {n,k} and OR-symmetric function [n,k]. Given n Boolean variables: X1,X2, . . . , Xn, the XOR-symmetric function {n,k}, is a XORing of products where each product consists of k of the n variables ANDed together and the products include all distinct ways of choosing k variables from n. The OR-symmetric function [n,k], is an ORing of products where each product consists of k of the n variables ANDed together and the products include all distinct ways of choosing k variables from n.
Examples of XOR-symmetric and OR-symmetric functions for the counter result bits, i.e., S1 and S2, of the (3,2) counter are:
The symmetric functions for the (7,3) counter are (where the superscript c means the ones complement, i.e., the bits are inverted):
The symmetric functions for the (15,4)counter are:
A divide and conquer methodology is used to implement the (7, 3) and (15,4) symmetric functions. The methodology is based on Chu's identity for elementary symmetric functions:
Chu's identity allows large combinatorial functions to be broken down into a sum of products of smaller ones. As an example, consider the four Boolean variables: X1, X2, X3, and X4. To compute [4,2], two groups of variables, e.g., group 0=(X1, X2) and group 1=(X3, X4), are taken one at a time and these two groups of variables are then taken two at a time:
Hence with r=s=2 and n=2 and using Chu's identity above:
[4,2]=[2,1]0[2,1]1+[2,2]0+[2,2]1
The eight inputs into the (7,3) counter are first grouped into four groups of two elements each, i.e., (X1,X2), (X3,X4), (X5,X6), (X7,X8), where X8=0. For the first group of (X1,X2), denoted by the subscript 0 in
[2,1]0=X1+X2
[2,2]0=X1X2
For the second group of (X3,X4), denoted by the subscript 1 in
[2,1]1=X3+X4
[2,2]1=X3X4
There are similar equations are for (X5,X6) and (X7,X8). Next the first two groups of the four groups of two are input into a first group of four (subscript 0). The second two groups of the four groups of two are input into a second group of four (subscript 1). As computation of the second group of four is similar to the first group of four, only the first group of four is given:
[4,1]0=[2,1]0+[2,1]1
[4,2]0=[2,1]0[2,1]1+[2,2]0+[2,2]1
[4,3]0=[2,1]0[2,2]1+[2,1]1[2,2]0
[4,4]0=[2,2]0[2,2]1
Next the two groups of four are combined to give the final count:
[8,4]=[4,1]0[4,3]1+[4,2]0[4,2]1+[4,3]0[4,1]1+[4,4]0+[4,4]1
[8,2]=[4,1]0[4,1]1+[4,2]0+[4,2]1
[8,6]=[4,2]0[4,4]1+[4,3]0[4,3]1+[4,4]0[4,2]1
Since X8=0 and [4,4]1=0,
[7,4]=[4,1]0[4,3]1+[4,2]0[4,2]1+[4,3]0[4,1]1+[4,4]0
[7,2]=[4,1]0[4,1]1+[4,2]0+[4,2]1
[7,6]=[4,3]0[4,3]1+[4,4]0[4,2]1
Hence,
S3=[7,4]
S2=[7,2][7,4]c+[7,6]
S1={7,1}
The symmetric functions for the (15,4) counter are divided into two parts. The two most significant bits (MSBs), e.g., S3 and S4 are computed using an OR symmetric function (AND-OR and NAND-NAND logic) and the two least significant bits (LSBs), e.g., S1 and S2, are computed using an XOR symmetric function.
The
For the MSBs the groups of two and four are constructed similarly to the (7,3) counter and the description is not repeated. The group of 8 is:
[8,1]=[4,1]0+[4,1]1
[8,2]=[4,1]0[4,1]1+[4,2]0+[4,2]1
[8,3]=[4,3]0+[4,3]1+[4,2]0[4,1]1+[4,2]1[4,1]0
[8,4]=[4,4]0+[4,4]1+[4,3]0[4,1]1+[4,1]0[4,3]1+[4,2]0[4,2]1
[8,5]=[4,4]0[4,1]1+[4,1]0[4,4]1+[4,2]0[4,3]1+[4,3]0[4,2]1
[8,6]=[4,2]0[4,4]1+[4,4]0[4,2]1+[4,3]0[4,3]1
[8,7]=[4,3]0[4,4]1+[4,4]0[4,3]1
[8,8]=[4,4]0[4,4]1
The final sums S3 and S4 for the MSBs are:
S4=[15,8]
S3=(([15,8]+[15,4]c)[15,12]c)c=[15,4][15,8]c+[15,12]
A more detailed description of the compressor block 2640 of
Referring back to
With reference to
When subtracting, the 1-bit full adder 3610 implements the equation Zc+(X+Y) which produces S and C for subtraction by inverting Z, i.e., Zc. To produce the subtraction result the output of the CLA 3620 is inverted in XOR gate 3622 prior to being stored in register bank 1755.
The carry-lookahead adder (CLA) 3620 in one embodiment receives the sum bits S[0:47] and Carry bits C[0:47] from the full adders 3610 in
The carry-lookahead adder is a form of carry-propagate adder that to pre-computes the carry before the addition. Consider a CLA having inputs, e.g., a(n) and b(n), then the CLA uses a generate (G) signal and a propagate (P) signal to determine whether a carry-out will be generated. When G is high then the carry in for the next bit is high. When G is low then the carry in for the next bit depends in part on if P is high. The forgoing relationships can be easily seen by looking at the equations for a 1-bit carry lookahead adder:
G(n)=a(n) AND b(n)
P(n)=a(n) XOR b(n)
Carry(n+1)=G(n) OR(P(n) AND Carry(n))
Sum(n)=P(n) XOR Carry(n)
where n is the nth bit.
In general, for a conventional fast carry look ahead adder the generate function is given by:
Gn−1:0=Gn−1:m+Pn−1:mGm−1:0
where Pn−1:m=pn−1pn−2 . . . pm
where pi=ai⊕bi
In order to improve the efficiency of a conventional CLA, the generate function is decomposed as follows:
Gn−1:0=Dn−1:m[Bn−1:m+Gm−1:0]
where Dn-1:m=Gn-1:m+1+pn−1pn−2 . . . pm
where Bn-1:m=gn−1+gn−2+ . . . +gm
where gi=aibi and pi=ai⊕bi
Other decompositions for G are:
Gn−1:0=Gn−1:m+Pn−1:mGm−1:0
Gn−1:0=Dn−1:mKn−1:0
Gn−1:0=Dn−1:m[Bn−1:i+Gi−1:k+Bk−1:m+Gm−1:0]
Gn−1:0=Dn−1:m[Bn−1:m+Gm−1:k′+Pm−1:iDi−1:jPj−1:k′Gk′−1:0]
An example of the new generate function G4:0 for n=4 and m=2 is:
G4.0=g4+p4g3+p4p3g2+p4p3p2g1+p4p3p2p1g0
a.=p4[g4+g3+p3g2+p3p2g1+p3p2p1g0] (since gipi=gi)
b.=[g4+p4p3] [g4+g3+g2+p2g1+p2p1g0]
c.=[g4+p4g3+p4p3p2] ([g4+g3+g2]+[g1+p1g0])
d.=[D4:2]+([B4:2]+[G1:0])
Using the new decomposition of G, we next define a K signal analogous to the G signal and a Q signal analogous to the P signal. The correspondence between the G and P functions and the K and Q functions are given in tables 8 and 9 below:
The K signal is related to the G signal by the following equation:
Kn−1:0=Bn−1:m+Gm−1:0
Assuming n−1>i>k>m>k′>m′>0, where n, i, k, m, k′, m′ are positive numbers, then:
K2=Bn−1:i+Gi−1:k
K1=Bk−1:m+Gm−1:k′
K0=Bk′−1:m′+Gm′−1:0
The Q signal is related to the P signal by the following equation:
Qn−1:0=Pn−1:m·Dm−1:0
where D can be expressed as:
Dn−1:0=Gn−1:m+Pn−1:mDm−1:0
Dn−1:0=Dn−1:m[Bn−1:m+Dm−1:0]
Hence, for example:
Q2=Pn−1:iDi−1:k
Q1=Pk−1:mDm−1:k′
Q0=Pk′−1:m′Dm′−1:0
The final sum for the 48-bit CLA 3620 is given by:
Sn=an⊕bn⊕Gn−1:0 n=4, 8, 12 . . . or 44
where Gn−1:0=Dn−1:mKn−1:10 where
Sn+d+1=an+d+1⊕bn+d+1⊕Gn+d:0 d=0, 1 or 2
Adder designs, including the CLA and the full adders shown in
While the present invention has been described in connection with specific embodiments, variations of these embodiments will be obvious to those of ordinary skill in the art. Therefore, the spirit and scope of the appended claims should not be limited to the foregoing description.
This patent application claims priority to and incorporates by reference the U.S. provisional application Ser. No. 60/533,280, entitled “Programmable Logic Device with Cascading DSP Slices”, by James M. Simkins, et al., filed Dec. 29, 2003.
Number | Name | Date | Kind |
---|---|---|---|
4639888 | Nussbaecher | Jan 1987 | A |
4680628 | Wojcik et al. | Jul 1987 | A |
4780842 | Morton et al. | Oct 1988 | A |
5095523 | Delaruelle et al. | Mar 1992 | A |
5317530 | Toriumi | May 1994 | A |
5339264 | Said et al. | Aug 1994 | A |
5349250 | New | Sep 1994 | A |
5388062 | Knutson | Feb 1995 | A |
5450339 | Chester et al. | Sep 1995 | A |
5455525 | Ho et al. | Oct 1995 | A |
5506799 | Nakao | Apr 1996 | A |
5572207 | Harding et al. | Nov 1996 | A |
5600265 | El Gamal et al. | Feb 1997 | A |
5642382 | Juan | Jun 1997 | A |
5724276 | Rose et al. | Mar 1998 | A |
5732004 | Brown | Mar 1998 | A |
5754459 | Telikepalli | May 1998 | A |
5809292 | Wilkinson et al. | Sep 1998 | A |
5828229 | Ahanin et al. | Oct 1998 | A |
5838165 | Chatter | Nov 1998 | A |
5883525 | Tavana et al. | Mar 1999 | A |
5914616 | Young et al. | Jun 1999 | A |
5933023 | Young | Aug 1999 | A |
6000835 | Pan et al. | Dec 1999 | A |
6014684 | Hoffman | Jan 2000 | A |
6038583 | Oberman et al. | Mar 2000 | A |
6069490 | Ochotta et al. | May 2000 | A |
6100715 | Agrawal et al. | Aug 2000 | A |
6108343 | Cruickshank et al. | Aug 2000 | A |
6131105 | Pajarre et al. | Oct 2000 | A |
6134574 | Oberman et al. | Oct 2000 | A |
6154049 | New | Nov 2000 | A |
6204689 | Percey et al. | Mar 2001 | B1 |
6223198 | Oberman et al. | Apr 2001 | B1 |
6243808 | Wang | Jun 2001 | B1 |
6249144 | Agrawal et al. | Jun 2001 | B1 |
6260053 | Maulik et al. | Jul 2001 | B1 |
6269384 | Oberman | Jul 2001 | B1 |
6282627 | Wong et al. | Aug 2001 | B1 |
6282631 | Arbel | Aug 2001 | B1 |
6288566 | Hanrahan et al. | Sep 2001 | B1 |
6298366 | Gatherer et al. | Oct 2001 | B1 |
6298472 | Phillips et al. | Oct 2001 | B1 |
6311200 | Hanrahan et al. | Oct 2001 | B1 |
6323680 | Pedersen et al. | Nov 2001 | B1 |
6341318 | Dakhil | Jan 2002 | B1 |
6347346 | Taylor | Feb 2002 | B1 |
6349346 | Hanrahan et al. | Feb 2002 | B1 |
6362650 | New et al. | Mar 2002 | B1 |
6366943 | Clinton | Apr 2002 | B1 |
6370596 | Dakhil | Apr 2002 | B1 |
6374312 | Pearce et al. | Apr 2002 | B1 |
6385751 | Wolf | May 2002 | B1 |
6389579 | Phillips et al. | May 2002 | B1 |
6392912 | Hanrahan et al. | May 2002 | B1 |
6397238 | Oberman et al. | May 2002 | B2 |
6438570 | Miller | Aug 2002 | B1 |
6448808 | Young et al. | Sep 2002 | B2 |
6449708 | Dewhurst et al. | Sep 2002 | B2 |
6457116 | Mirsky et al. | Sep 2002 | B1 |
6483343 | Faith et al. | Nov 2002 | B1 |
6496918 | DeHon et al. | Dec 2002 | B1 |
6519674 | Lam et al. | Feb 2003 | B1 |
6526430 | Hung et al. | Feb 2003 | B1 |
6526557 | Young et al. | Feb 2003 | B1 |
6530010 | Hung et al. | Mar 2003 | B1 |
6538470 | Langhammer et al. | Mar 2003 | B1 |
6539477 | Seawright | Mar 2003 | B1 |
6556044 | Langhammer et al. | Apr 2003 | B2 |
6573749 | New et al. | Jun 2003 | B2 |
6693455 | Langhammer et al. | Feb 2004 | B2 |
6820102 | Aldrich et al. | Nov 2004 | B2 |
6864714 | Digari et al. | Mar 2005 | B2 |
6873182 | Mohan et al. | Mar 2005 | B2 |
6904446 | Dibrino | Jun 2005 | B2 |
6920627 | Blodget et al. | Jul 2005 | B2 |
6925480 | Duborgel | Aug 2005 | B2 |
6947916 | Luo et al. | Sep 2005 | B2 |
7129762 | Vadi | Oct 2006 | B1 |
7142010 | Langhammer et al. | Nov 2006 | B2 |
7174432 | Howard et al. | Feb 2007 | B2 |
7178130 | Chuang et al. | Feb 2007 | B2 |
7193433 | Young | Mar 2007 | B1 |
7194598 | Jacob | Mar 2007 | B2 |
7197686 | Box et al. | Mar 2007 | B2 |
20020138538 | Talwar et al. | Sep 2002 | A1 |
20020138716 | Master et al. | Sep 2002 | A1 |
20030041082 | Dibrino | Feb 2003 | A1 |
20030055861 | Lai et al. | Mar 2003 | A1 |
20030105949 | Master et al. | Jun 2003 | A1 |
20030140077 | Zaboronski et al. | Jul 2003 | A1 |
20030154357 | Master et al. | Aug 2003 | A1 |
20040010645 | Scheuermann | Jan 2004 | A1 |
20040030736 | Scheuermann | Feb 2004 | A1 |
20040078403 | Scheuermann et al. | Apr 2004 | A1 |
20040093465 | Ramchandran | May 2004 | A1 |
20040093479 | Ramchandran | May 2004 | A1 |
20040143724 | Jacob et al. | Jul 2004 | A1 |
20040168044 | Ramchandran | Aug 2004 | A1 |
20040181614 | Furtek et al. | Sep 2004 | A1 |
20050038984 | Heidari-Bateni et al. | Feb 2005 | A1 |
20050039185 | Heidari-Bateni et al. | Feb 2005 | A1 |
20050144210 | Simkins et al. | Jun 2005 | A1 |
20050144211 | Simkins et al. | Jun 2005 | A1 |
20050144212 | Simkins et al. | Jun 2005 | A1 |
20050144213 | Simkins et al. | Jun 2005 | A1 |
20050144216 | Simkins et al. | Jun 2005 | A1 |
20050187998 | Zheng et al. | Aug 2005 | A1 |
20060015701 | Hogenauer | Jan 2006 | A1 |
20060190516 | Simkins et al. | Aug 2006 | A1 |
20060190518 | Ekner et al. | Aug 2006 | A1 |
20060195496 | Vadi et al. | Aug 2006 | A1 |
20060206557 | Wong et al. | Sep 2006 | A1 |
20060212499 | New et al. | Sep 2006 | A1 |
20060230092 | Ching et al. | Oct 2006 | A1 |
20060230093 | New et al. | Oct 2006 | A1 |
20060230094 | Simkins et al. | Oct 2006 | A1 |
20060230095 | Simkins et al. | Oct 2006 | A1 |
20060230096 | Thendean et al. | Oct 2006 | A1 |
20060288069 | Simkins et al. | Dec 2006 | A1 |
20060288070 | Vadi et al. | Dec 2006 | A1 |
Number | Date | Country |
---|---|---|
2 365 636 | Feb 2002 | GB |
2 373 883 | Oct 2002 | GB |
2 383 435 | Jun 2003 | GB |
WO 0189091 | Nov 2001 | WO |
WO 2005066832 | Jul 2005 | WO |
WO 2005110049 | Nov 2005 | WO |
Number | Date | Country | |
---|---|---|---|
20050144212 A1 | Jun 2005 | US |
Number | Date | Country | |
---|---|---|---|
60533280 | Dec 2003 | US |