The present invention relates to the field of reconfigurable logic devices. More specifically, the present invention relates to the use of Arithmetic Logic Units (ALUs) and Look-up-Tables (LUTs) in reconfigurable logic devices.
A reconfigurable logic device typically comprises an array consisting of multiple instances of a basic processing element (often referred to as a “CLB” (for Configurable Logic Block), or a “tile”), together with a routing network connecting the tiles together (disclosed in, for example, U.S. Pat. No. 6,353,841 and US2002/0157066). Other functional blocks may also be included in the device, which functional blocks may be used to perform dedicated functions.
Two classes of reconfigurable logic devices are LUT-based Field Programmable Gate Arrays (FPGAs) and ALU arrays.
LUT-based FPGAs use Look-up-Tables (LUTs), a small memory that is used to store the truth table of a Boolean function. LUTs typically have a small number of single-bit inputs (usually between 3 and 6), and produce a single-bit output.
In ALU arrays, the basic processing element is a circuit (ALU) capable of implementing arithmetic functions (normally Add and Subtract, as well as occasionally Multiplication), comparison functions (Equals, NotEquals) and logic functions (such as bitwise AND, OR, XOR and NOT). ALUs typically have have 2 word-wide inputs, and a single-bit carry output. Word lengths vary, with the smallest common value being 4 bits. Other common values are 8, or 32 bits.
Each of the above reconfigurable processing devices has its own advantages. For example, LUT-based devices tend to be more flexible, as they can implement any Boolean function of their input, whilst ALU-based devices are generally faster when implementing typical operations of word-wide data.
Thus, it would be advantageous to have a system which provides both ALU and LUT functionality. The disadvantage of such a system however is that it requires a large amount of routing resources in order to have the LUTs and ALUs work together. Moreover, adding these independent ALUs and LUTs results in an array which has an area that comprises the sum of the areas of these separate components.
Accordingly, an object of the present invention is to combine ALU and LUT functionality in a reconfigurable logic device such that the resulting circuit does not unduly burden the logic device's routing network. Another object of the present invention if to share components between ALUs and LUTs in order to reduce total area.
In order to solve the problems associated with the prior art, the present invention provides a combinatorial processing element used in a reconfigurable logic device having a plurality of processing elements interconnected by way of a routing network, the combinatorial processing element includes:
an arithmetic logic unit, having at least one input;
a multiplexer tree, having a data input; and
a memory device,
wherein the processing element is arranged such that the memory can be connected to the data input of the multiplexer tree and/or the at least one input of the arithmetic logic unit.
Preferably, the combinatorial processing element further comprises:
an input arranged to be connected to the routing network of the reconfigurable device.
Preferably, the at least one input of the arithmetic logic unit is an N-bit input;
the multiplexer tree further comprises M select inputs and 2M data inputs, the multiplexer tree being arranged to select any of the 2M data inputs; and
the memory device is an N-bit memory device arranged to be connected to the N-bit input of the ALU and/or to N of the 2M data inputs of the multiplexer tree.
Preferably, N is smaller or equal to one half of 2M and the combinatorial processing element further comprises:
a plurality of memory devices, wherein each of the plurality of memory devices is arranged to be connected to a separate input of the arithmetic logic unit and/or separate data inputs of the multiplexer tree.
Preferably, the at least one input of the arithmetic logic unit is an N-bit input;
the multiplexer tree comprises M select inputs and an N-bit data input, the multiplexer tree being arranged to select one bit of the N-bit data input; and
the memory device is an N-bit memory device arranged to be connected to the N-bit input of the ALU and/or to N of the 2M data inputs of the multiplexer tree.
Preferably, the combinatorial processing element further comprises:
at least one N-bit input connected to the routing network of the reconfigurable logic device.
Preferably, the sum of N-bit inputs of the ALU and N-bit inputs of the multiplexer tree is more than the number of N-bit inputs connected to the routing network of the reconfigurable logic device.
Preferably, the memory devices are registers which are connected to the routing network of the reconfigurable logic device.
The present invention further provides a reconfigurable logic device which comprises:
a combinatorial processing element in accordance with any one of the preceding claims.
Preferably, at least one combinatorial processing element is arranged to provide a gateway between a single-bit routing network and a multi-bit routing network in the reconfigurable logic device.
As will be appreciated, the present invention provides several advantages over the prior art. For example, because a single local memory is used for both the LUT and the ALU, it is possible to combine the functionality of these devices without using up valuable routing resources. Moreover, and as a consequence of having the LUT and ALU use the same local memory resource, the combined operation of the LUT and ALU can be executed at much higher speeds than those exhibited by a circuit configured to combine a LUT and an ALU across the routing network of a reconfigurable logic device. Also, the sharing of constants between LUTs and ALUs avoids the need for separate storage for LUT constants and ALU input constants, or for extra registers elsewhere in the array to optionally store constants. Furthermore, the ability to use the multiplexer tree as either LUT or bit extraction circuit reduces the number of dedicated bit extraction circuits needed.
Specific embodiments of the present invention will now be described with reference to the accompanying drawings, in which:
Because the LUT 10 stores a truth table directly, it can implement any Boolean function of its inputs. This makes LUT-based architectures particularly advantageous when implementing applications that can be decomposed into a number of complex functions of a small number of inputs. A small state machine with a complex set of transitions between the states is an example of such an application.
LUT-based architectures are however not particularly efficient at implementing functions with considerably more inputs than a basic LUT provides. For example, the output of the most-significant bit of a 32-bit adder depends on all bits of both 32-bit inputs (64 bits in total). LUT-based architectures therefore often contain extra logic to try to improve carry propagation for arithmetic functions.
Dissimilarly, ALUs are circuits specifically designed for processing word-based data. A typical ALU has two word-wide inputs, and one word-wide output. It may also have a small number of single bit inputs, and a similar number of single-bit outputs. These single bit inputs and outputs are used to pass control signals between ALUs. For example, one ALU may perform a comparison function, and the result is used to control another ALU that is acting as a multiplexer. The functions that an ALU can perform are described in terms of the way that they transform the input words, rather than their effect on the individual bits. For example, the functional of an ALU can be described as “add”, “subtract” or “test for equality”.
An ALU may however only provide a small number of functions, such as those listed in the table of
What the applicant has realised is that when comparing ALUs and LUTs in greater detail, it is possible to find certain complimentary properties. For example, LUTs efficiently implement arbitrary functions of a small number of unstructured input bits, but are significantly less efficient when dealing with functions with a large number of inputs. Conversely, ALUs efficiently implement a small number of functions of word-wide data. In essence, they exploit knowledge of the structure of the input data (i.e. its organisation as words) to provide a compact implementation of an important subset of the complete list of possible functions. ALUs are less efficient when the data lacks this kind of structure, or uses functions outside the chosen subset.
One further difference between LUTs and ALUs relates to the way that they use constants in a circuit design. In a LUT-based architecture, constants can always be optimised away. For instance a comparison to a constant:
A=B[3:0]==4'b1101;
A=(B[3]==1)&(B[2]==1)& (B[1]==0)&(B[0]==1);
A=!(B[3]̂1)&!(B[2]̂1)& !(B[1]̂0)&!(B[0]̂1);
A=B[3]&B[2]&!B[1]&B[0];
The result of this is an arbitrary function of a group of input bits, which function is the type which can easily then be mapped into one or more LUTs.
In an ALU-based architecture, the implementation of the above example is different. For an ALU-based circuit, the equality test would be mapped onto an ALU implementing an EQUALS operation and, separately, a constant would be created and stored in a register in the array. The circuit would then compare a word-wide first input of the ALU with the input which is connected to the register. Accordingly, an ALU-based architecture has a greater need for registers to store these constants, than does a LUT-based architecture.
As mentioned above, ALU-based architectures process words rather than individual bits. It is however sometimes necessary to access individual bits within a word. Therefore, an ALU-based architecture needs some way to test and/or set individual bits within a word. This can be done either by extending (i.e. adding additional instructions) the ALU to include such test and set operations, or by including separate logic for such purposes.
In order to create a hybrid architecture of ALUs (for processing word-based data) and LUTs (for processing unstructured data), the prior art teaches towards having a group of ALUs and a separate group of LUTs having control signals passing back and forth between the separate groups. Contrary to this approach, the present invention integrates a LUTs and ALUs into a single integrated unit, which does not require external routing in order to operate.
The constant memory is therefore usable as either a constant input to the ALU, or as the Boolean function store for the LUT.
As will be appreciated by the skilled reader, the above described circuit can operate in several different ways. For example, the circuit can operate as an ALU with externally supplied inputs InA and InB, and a LUT with locally stored data. Furthermore, the circuit can operate as an ALU with a constant input and externally supplied input InA, and a multiplexer tree that can select a bit from word-wide input InC. Moreover, the circuit can also operate as an ALU with externally supplied inputs InA and InC, and a multiplexer tree that can select a bit from a word-wide input. There may also be circumstances where the same constant value is needed by both the ALU and the LUT, so that it is possible to combine an ALU (with a constant input) and a LUT together. Providing this flexibility in a local area is a major advantage of the invention.
As will now be described, the present invention can take one of three basic forms, depending on the relative widths of the LUT constant store, and the ALU wordlength.
The form is where the ALU wordlength is less than the LUT memory size. This situation is shown in
In a situation where more than one wordlength-sized group is present, it is possible to add optional constants to more than one ALU input in the manner shown in
The second option sees the addition of constants to inputs of more than one ALU, as shown in
As will also be appreciated by the skilled reader, it is possible to combine these options, and have multiple constants connecting to each of multiple ALUs. It is also possible for a single constant connect to multiple ALUs.
The second basic form of the present invention is where the ALU wordlength is equal to the LUT memory size. This situation is shown in
Finally, the third basic form of the present invention is when the ALU wordlength is greater than the LUT memory size. This situation is shown in
As will be appreciated from the above description, the most flexible structure is the first, where the LUT memory size is greater than the ALU wordlength, and the wordlength is a factor of the memory size. The applicant has realised that the preferred size of LUT is one with between 3 and 6 inputs, i.e. needing between 8 and 64 memory bits. In turn, this implies that the invention is best used with ALUs with sizes that are smaller than this.
The present invention can be used advantageously in a great many situations, one of which is shown in
Arrays with separate word-wide and single-bit routing networks are known from the prior art. In such an array, the circuit of the present invention is sufficient to provide gateways from single-bit to multi-bit routing, and from multi-bit to single-bit routing. As can be seen from
Alternative embodiments of the present invention will now be described with reference to
The functional advantage of this embodiment is the increased design flexibility it provides. The disadvantage however is that the register cell is larger than a constant cell. Therefore, this extension is typically only advantageously used in designs that require large numbers of registers, for instance for a high-speed design that requires a large number of registers to “pipeline” it. As will be appreciated by the skilled reader, “pipelining” is a method used to increase the operating frequency of an application by inserting added registers into the application in such a way that the length (delay) of the longest combinatorial path is reduced. Although the resulting circuit has a higher operating frequency, it also has a longer delay (in terms of clock cycles) and requires the use of extra registers.
Another alternate embodiment of the present invention is shown in
Whilst this embodiment constrains the use of the ALU and multiplexer tree, since they cannot use independent external inputs, it also reduces the size of the routing network since it no longer needs to support independent connections to both ALU and multiplexer tree. This modification results in an area saving for designs that use a large number of constants, either for the ALUs, or because they contain many LUTs.
The present invention can be used in a wide variety of circuits. For example,
A particular advantage of this circuit is that it can be used to implement functions that combine the operation of ALU and LUT, as described in the following examples.
The first example is where the LUT output connects to ALU Cin, and ALU implements a multiplexer function. With InA/B connected to the ALU, and the constants connected to the LUT, the ALU-based multiplexer can be controlled by an arbitrary function of the LUT inputs In0, In1, In2. i.e.:
OutA=F(In0,In1,In2)?InA:InB.
The above-described first example can be advantageously used in a circuit arranged to perform saturated arithmetic, as will now be described with reference to
In the case of the addition of two signed numbers, there are two possible overflow conditions. The first overflow condition is when two positive n-bit numbers add to give a result that is larger than the most positive number that can be represented in n bits. In this case, the calculated result is replaced with the most-positive n-bit signed integer—a leading 0 followed by (n−1) 1s.
The second overflow condition is when two negative n-bit numbers add to give a result that is smaller (more negative) than the most negative number that can be represented in n bits. In this case, the calculated result is replaced with the most-negative n-bit signed integer—a leading 1 followed by (n−1) 0s.
If a positive and a negative number are summed, the result cannot overflow—it must lie in the legal range.
Instance1 of the circuit uses the ALU in order to compute the sum of A and B:
Z[n−1:0]=A[n−1:0]+B[n−1:0]
Instance2 of the circuit uses the ALU and the input constants to generate the possible saturation value. Here, the ALU is used as a multiplexer to choose between the two possible constant values, and is controlled by the sign bit (the most significant bit) of A.
Overflow_val[n−1:0]=A[n−1]?1000 . . . : 0111 . . . ;
Instance 3 of the circuit uses the LUT to determine whether an overflow has occurred, and then uses the ALU as a multiplexer to choose between the result of the initial addition and the saturation value:
Overflow=(A[n−1]==B[n−1]&(A[n−1]!=Z[n−1];
i.e. the inputs have same sign but the output does not have the same sign.
Result=overflow?overflow_val:Z;
A second example of an advantageous circuit implemented using the present invention is where the ALU Cout connects to LUT In0, and the ALU implements an EQUALS function. With InA/B connected to the ALU, and the constants connected to the LUT, the LUT can generate an arbitrary function of the ALU Cout, and the LUT inputs In1, In2. i.e.:
This type of function is a useful building block when constructing state machines, where the next state may depend on both the current state, and the values of one or more inputs. For instance, the ALU may test the inputs, while In1 and In2 are derived from the current state of the state machine.
Also, this type of connection can be used to combine multiple tests into a single result. For example, if In1 is connected (via LutIn1) to the carry output of another ALU elsewhere in the array, it becomes possible to construct more complex tests, such as:
Out0=F(InA==InB, InC<InD,In2);
where InC and InD are the inputs to the second ALU. For instance, F may be an OR of its various inputs, which allows for the construction of more complex state machines, with more complex transition conditions.
A third example of is where a combination of multiple comparisons occurs when performing an equality test function for words that are wider than the native wordlength of the ALU. Ordinarily, this would use multiple ALUs in series, linked together by connecting the Cout of one ALU to the Cin of another. However, such a comparison will fail if the partial match in any individual ALU fails. Using the connection from Cin to the LUT In1 input increases the speed of this kind of function. If Cin indicates a failure of the comparison in an earlier part of the word, this can propagate directly to the LUT output, rather than going via the ALU Cin-to-Cout circuit.
The preceding examples connect the constants to the LUT. However, it is also possible to connect one of the stored constants to the ALU. For example, by connecting the constant store B to the ALU. Then the ALU can compare to a constant:
Cout=InA==ConstB
The LUT can then be connected to InB and constant store A. if the LUT inputs In0 and In1 are both set to constant 0, and In2 is connected to ALU Cout, then:
Out0=In22?ConstantA[0]:InB[0],
and in the case where ConstantA[0] is 1, this becomes:
which is equivalent to an OR of the result of the comparison, and an external input bit. Changing the values of the constants on In0 and In1 will change the bit of InB that is used in this function.
Similarly, connecting the constant store A to the ALU, and constant store B to the LUT results in a function of the form:
Out0=In2?InA[i]:ConstB[i],
with ConstB equal to 0, it can be seen that:
which is equivalent to an AND of the result of the comparison, and an external input bit. As will be appreciated, all of the above circuits can be implemented using the basic circuit of the present invention.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/EP2010/055485 | Apr 2010 | US |
Child | 13552915 | US |