This invention relates to decoders. Specifically, it relates to mapping units and configurable decoders based upon mapping units, where each device outputs more bits than are input to it.
Over time, processor speeds have increased faster than the rate at which information can enter and exit a chip. In many cases, it was found that increasing processor speed while ignoring the effects of input/output (I/O) produced little improvement—essentially, if information cannot get into or out of the chip at a fast enough rate, then increasing CPU speed diminishes in importance.
Data transfer to and from a chip can be improved by increasing the bit rate and/or the number of I/O pins. Since pins cannot be miniaturized to the same extent as transistors (pins must be physically strong enough to withstand contact), the rate at which the number of transistors on a chip has increased far outpaces the rate at which the number of pins on a chip has increased. For example, in Intel microprocessors, the number of transistors has increased by a factor of 20,000 in the last 30 years, whereas the number of pins in these chips increased merely by a factor of 30. Hence, the rate at which a chip can generate and process information is much larger than the available conduit to convey this information. The restriction imposed by the unavailability of a sufficient number of pins in a chip is called “pin limitation.”
An example of the magnitude of the problem is presented by reconfigurable architectures, in particular, integrated circuit chips such as Field Programmable Gate Arrays (FPGAs). An FPGA is an array of programmable logic elements, all of which must be configured to suit the application at hand. A typical FPGA structure consists of a two-dimensional array of configurable logic elements connected by a configurable interconnection network, such as shown in
The FPGA's interconnection network is typically a two-dimensional mesh of configurable switches. As in a CLB, each switch S represents a large bank of configurable elements. The state of all switches and elements within all CLBs is referred to as a “configuration” of the FPGA. Because there is a large number of configurable elements in an FPGA (LUTs, flip-flops, switches, etc.), a single configuration requires a large amount of information. For example, the Xilinx Virtex-5 FPGA with a 240×108 array of CLBs requires in the order of 79 million bits for a single full configuration. The FPGA's CLBs are fine-grained functional elements that are incapable of executing instructions or generating configuration bits internally. Thus, configuration information must come from outside the chip. A limited amount of configuration information can be stored in the chip as “contexts;” however, given the limited amount of memory available on an FPGA for such a purpose, an application may require more contexts than can be stored on the FPGA. Hence, in most cases, configuration information must still come from outside the chip, and the pin limited input can have severe consequences for the time needed for reconfiguration.
A number of applications benefit from a technique called dynamic reconfiguration, in which elements of the FPGA chip are reconfigured to alter their interconnections and functionality while the application is executing on the FPCA. Dynamic reconfiguration has two main benefits. First, a dynamically reconfigurable architecture can reconfigure between various stages of an application to use its resources efficiently at each stage. That is, it reuses hardware resources more efficiently across different parts of an algorithm. For example, an algorithm using two multipliers in Stage 1 and eight adders in Stage 2 can run on dynamically reconfigurable hardware that configures as two multipliers for Stage 1 and as eight adders for Stage 2. Consequently, this algorithm will run on hardware that has two multipliers or eight adders, as opposed to a non-configurable architecture that would need two multipliers and eight adders.
The second benefit of dynamic reconfiguration is a fine tuning of the architecture to exploit characteristics of a given instance of the problem. For example in matching a sequence to a given pattern, the internal “comparator” structure can be fine-tuned to the pattern. Further, this tuning to a problem instance can also produce faster solutions.
Dynamic reconfiguration requires a fast reconfiguration scheme. Because of this, partial reconfiguration is normally performed where only a portion of the FPGA is reconfigured. Partial configuration involves selecting the portion of the FPGA requiring reconfiguration (the addresses) and inputting the necessary configuration bits. Due to pin limitation, only a very coarse selection of addresses is available in a given time increment, resulting in a still substantially large number of FPGA elements being selected for reconfiguration. This implies that elements that do not need to be reconfigured must be “configured” anyway along with those that actually require reconfiguration.
In partial reconfiguration, the information entering the chip can be classified into two categories: (a) selection and (b) configuration. The selection information contains the addresses of the elements that require reconfiguration, while the configuration information contains the necessary bits to set the state of the targeted elements.
In order to facilitate partial reconfiguration, FPGAs are typically divided into sets of frames, where a frame is the smallest addressable unit for reconfiguration. In current FPGAs, a frame is typically one or more columns of CLBs. Currently, partial reconfiguration can only address and configure a single frame at a time, as a 1-hot decoder is usually employed. If we assume that each CLB receives the same number of configuration bits, say α, and the number of CLBs in each frame is the same, say C, then the number of configuration bits needed for each frame is Ca. If the number of bits needed for selecting a single frame is b, then the total number of bits B needed to reconfigure a frame is:
B=b+C∞
Since the granularity of reconfiguration is at the frame level, every CLB in a frame would be reconfigured, regardless of whether or not the application required them to be reconfigured. This can result in a “poorly-focused” selection of elements for reconfiguration, as more elements than necessary are reconfigured in each iteration. This implies that a large number of bits and a large time overhead are spent on the reconfiguration of each individual frame. If the granularity of selection is made finer, i.e., if fewer CLBs are in each frame, then the number of selection bits needed to address the frames increases by a small amount while the number of configuration bits for each frame decreases. However since a 1-hot decoder can select only one frame per iteration, this also increases (on an average) the total number of iterations necessary to reconfigure the same amount of area in the FPGA. Pin limitation thus creates a severe restriction on the extent to which an FPGA can be dynamically reconfigured.
Before we proceed further, we introduce some notation.
In general, we use the term “word” to mean a set of bits. Different words may have different numbers of bits. We also use the terms “string” and “signal” synonymously with “word.”
The O(·) notation indicates an upper bound on the “order of” and is used to describe how the size of the input data affects resources (time, cost etc.) in an algorithm or hardware. Specifically, for two functions ƒ(n) and g(n) of a variable n, we say that ƒ(n)=O(g(n)) if and only if, there is positive constant c>0 and an integer constant n0, such that for all n≧n0, we have ƒ(n)≦cg(n). The relationship ƒ(n)=0(g(n)) signifies that the “order of” (or asymptotic complexity of) ƒ(n) is at most that of g(n) or that ƒ(n) increases at most as fast as g(n). If 0( . . . ) denotes a lower bound on the complexity, then Ω(·) and θ(·) indicates an upper bound on, and the exact complexity, respectively. Specifically, ƒ(n)=Ω(g(n)) if and only if g(n)=O(ƒ(n)). We say ƒ(n)=θ(g(n)) if and only if ƒ(n)=O(g(n)) and ƒ(n)=Ω(g(n)).
Parts of the invention will be described in terms of “ordered partitions.” A partition of set A is a division of the elements of the set into disjoint non-empty subsets (or blocks). A partition π with k blocks is called a k-partition. For example, a 3-partition of the set {8,7,6,5,4,3,2,1,0} is {{7,6,5,4},{3,2},{1,0}}. Partitions have no imposed order. An ordered k-partition is a k-partition {S0, S1, . . . , Sk−1) with an order (from 0 to k−1) imposed on the blocks. An ordered partition will be denoted ordered list of blocks. For instance, a 2-partition {S0,S1} may be ordered as S0,S1 or and S1,S0 and S0≠S0.
A useful operation on partitions is the product of two partitions. Let π1 and π2 be two (unordered) partitions (not necessarily of the same size). Let π1={S1,S0, . . . , Sk} and π2={P0,P1, . . . , Pl} then their product π1π2 is a partition {Q0,Q1, . . . , Qm} such that for any block Qhεπ1π2, elements a, bεQh if and only if there are blocks Si,επ1 and Pjεπ2, such that a,bεSi∩Pj. That is, two elements are in the same block of π1π2 if and only if they are in one block of π1 and in one block of π2. For instance, consider the partitions π1={{7,6,5,4}, {3,2},{1,0}} and π2={{7,6},{5,4,3,2},{1,0}}. Then π1π2={{7,6},{5,4},{3,2},{1,0}=π2π1
For any digital circuit, including those considered in this invention, an n-bit output can be viewed as a subset of an n-element set. Let Zn={0,1, . . . , n−1}. Consider an n-bit signal A=A (n−1)A(n−2) . . . A(0) (where A(i) is the ith bit of A; in general, we will consider bit 0 to be the least significant bit or the lsb). If A is an n-bit output signal (or word) of a digital circuit, then it can be viewed as the subset iεZn:A(i)=1} of Zn. The n-bit string A is called the characteristic string of the above subset. The set {iεZn:A(i)=1} is said to be characterized by A and is sometimes referred to as the characteristic set. For example if n=8, then output A=00001101 corresponds to the subset {0,2,3}. Outputs 00000000 and 11111111 correspond to the empty set, Ø and Zn, respectively. (It should be noted that the convention could be changed to exchange the meanings of 0's and 1's. That is, a 0 (resp., 1) in the characteristic string represents the inclusion (resp., exclusion) of an element of Zn in the set. All ideas presented in this document apply also to this “active-low” convention.) Throughout this document, we assume (unless mentioned otherwise) that the base of all logarithms is 2. Consequently, we will write log n to indicate log2 n. We will also use the notation loga n to denote (log n)a.
Prior art methods to address the pin limitation problem include: (1) multiplexing, (2) storing information within the design, and (3) decoding. Multiplexing refers to combining a large number of channels into a single channel. This can be accomplished in a variety of ways depending on the technology. Each method assumes the availability of a very high speed, high bandwidth channel on which the multiplexing is performed. For example, in the optical domain, wavelength division multiplexing allows multiple signals of different wavelengths to travel simultaneously in a single waveguide. Time division multiplexing requires the multiplexed signal to be much faster than the signals multiplexed. Used blindly, this is largely useless in the FPGA setting, as it amounts to setting an unreasonably high clocking rate for parts of the FPGA.
Storing information within the design attempts to alleviate the pin limitation problem by generating most information needed for execution of an application inside the chip itself (as opposed to importing it from outside the chip). This requires a more “intelligent” chip. In an FPGA setting it boils down to an array of coarse grained processing elements rather than simple functional blocks (CLBs). One example is the use of virtual wires in which each physical wire corresponding to an I/O pin is multiplexed among multiple logical wires. The logical wires are then pipelined at the maximum clocking frequency of the FPGA, in order to utilize the I/O pin as often as possible. Another example of such a solution is the Self-Reconfigurable Cate Array. This latter approach is a significant departure from current FPGA architectures. Yet another approach is to compress the configuration information, thereby reducing the number of bits sent into the chip.
Decoders are the third means used to address the pin limitation problem. A decoder is typically a combinational circuit that takes in as input a relatively small number of bits, say x bits, and outputs a larger number of bits, say n bits, according to some mapping; such a decoder is called an “x-to-n decoder.” If the x inputs are pins to the chip and the is outputs are expanded within the chip, a decoder provides the means to deliver a large number of bits to the interior of the chip. An x-to-n decoder (that has x input bits) can clearly produce no more than 2x output sequences, and some prior knowledge must be incorporated in the decoder to produce a useful expansion to n output bits. Decoders have also been used before with FPGAs in the context of configuration compression, where dictionary based or statistical schemes are employed to compress the stream of configuration bits. Our invention when used in the context of FPGAs has more application in selecting parts of the chip in a more focused way than conventional decoders do. However in a broader context, the method we propose is a general decoder for any scheme employing fixed size code words, that decode into (larger) fixed size target words.
As we noted earlier, for any digital circuit, including a decoder, an n-bit output can be viewed as a subset of the n-element set Zn={0,1, . . . , n−1). Thus, the set of outputs produced by an x-to-n decoder can be represented as a set of (at most 2x) subsets of Zn.
An illustration of 3-to-8 decoders (with 3 input bits and 8 output bits) is shown in Table 1.
Sets So, S1, S2 and S3 represent different decoders, each producing subsets of Zn. For instance, S0 corresponds to the set of subsets{{0},{1},{2}, . . . 7}}. This represents the 3-to-8 one-hot decoder.
Current decoders in FPGAs are fixed decoders, producing a fixed set of subsets (output bit combinations) over all possible inputs. The fixed decoder that is normally employed in most applications is the one-hot decoder that accepts a
(log2 n)-bit input and generates a 1-element subset of Zn (see set So in Table 1). (In subsequent discussion all logarithms will be assumed to be to base 2, that is, log n=log2 n). In fact, the term “decoder” is usually taken to mean the one-hot decoder.
A one-hot decoder causes severe problems if, in an array of n elements, some arbitrary pattern of those elements is needed for reconfiguration. Here, selecting an appropriate subset can take up to θ(n) iterations. Notwithstanding this inflexibility, one-hot decoders are simple combinational circuits with a low O(n log n) gate cost (typically given as the number of gates) and a low O(log n) propagation delay. The one-hot decoder will usually take multiple cycles or iterations to set all desired elements to the desired configuration. Thus, reconfiguration is a time consuming task in current FPGAs and consequently, they fail to fully exploit the power of dynamic reconfiguration demonstrated on theoretical models.
Look-up tables (LUTs) can function as a “configurable decoder.” A 2x×n LUT is simply a (2x)-entry table, where each entry has n bits. It can produce 2x independently chosen n-bit patterns that can be selected by an x-bit address. LUTs are highly flexible as the n-bit patterns chosen for the LUT need no relationship to each other. Unfortunately, this “LUT decoder” is also costly; the gate cost of such a LUT is O(n2x). For a gate cost of 0(n log n), a LUT decoder can only produce 0(log n) subsets or mappings. To produce the same number of subsets as a one-hot decoder, the LUT decoder has 0(n2) gate cost. Clearly, this does not scale well.
What is needed is a configurable decoder that is an intermediary to the high flexibility, high cost LUT decoder and the low flexibility, low cost fixed decoder.
It is an object of the invention to allow the multicasting of x bits into n bits through hardwired circuitry where the hardwired route is selected by an input selection word.
It is an object of the invention to provide a device that can be incorporated into an FPGA device (or any other chip operating in a pin-limited environment) to allow for the expansion of x bits input, to the FPGA device over x pins, to be expanded into n bits (where n>x) internally in the FPGA to allow for an increase in the selection reconfiguration information to reconfigure the FPGA device.
It is an object of the invention to allow the multicasting of x bits into n bits, α bits at a time, through hardwired circuitry, where the hardwired route is selected by an input selection word from a reconfigurable memory device.
It is an object of the invention to provide a reconfigurable mapping unit in conjunction with a second reconfigurable memory unit, where the second memory unit allows for selection of the z bits to be input into the reconfigurable mapping unit from x bits, where x>z.
Accordingly, the invention includes a reconfigurable mapping unit that is a circuit, possibly in combination with a reconfigurable memory device. The circuit has as input an x-bit word having a value at each bit position, and a selector bit word, input to the circuit. The circuit outputs an n-bit word, where n>x, where the value of each bit position of the n-bit output word is based upon the value of a pre-selected hardwired one of the bit positions in the x-bit word, where said hardwired pre-selected bit positions is selected by the value of the selector bit word. The invention may include a second reconfigurable memory device that outputs the z-bit word, based upon an input x-bit word to the second memory device, where x<z. The invention may produce the output n bit, α bits at a time.
that can be used as a serial-to-parallel and parallel-to-serial converter.
The invention includes a mapping unit, and a configurable decoder that incorporates a mapping unit. The mapping unit may be an integral or bit-slice mapping unit. The invention includes configurable decoder variants and methods to construct the partitions required to configure a mapping unit. We will compare the invention to existing circuits, where the comparison is in terms of performance parameters (such as circuit delays and circuit costs, as measured by the number of overall gates in a design). All parameters are expressed in terms of their asymptotic complexity to avoid minor variations due to technology and other implementation-specific details.
We assume that each instance of a gate has constant fan-in, constant fan-out, unit cost and unit delay; the fan-in and fan-out are each assumed to be at least 2 and here constant means independent of problem size. While the cost and delay of some logic gates (such as XOR) is certainly larger than the size and delay of smaller logic gates (such as NAND in some technologies), the overall number of gates in the circuit and the depth of the circuit provide a better measure of the circuit's costs and delays, rather than factors arising from choices specific to a technology and implementation. We divide the performance parameters into two categories: independent parameters and problem dependent parameters. Independent parameters are applicable to all circuits, while problem dependent parameters are specific to decoders. The calculated performance parameters are delay and gate cost. The delay or time cost of a combinational circuit is the length of the longest path from any input of the circuit to any output. The gate cost (or simply cost) of a circuit is the number of gates (AND, OR, NOT) in it. Clearly, the use of other gates such as NAND, XOR, etc. will not alter the gate cost expressed in asymptotic notation.
Here a decoder is a combinational circuit (with the exception of the bit-slice units later described), that, in order to achieve a greater degree of flexibility, can be combined with lookup tables (LUTs), to create a configurable mapping unit or a configurable decoder. While LUTs could be implemented using sequential elements, for this work, LUTs are functionally equivalent to combinational memory such as ROMs. Any type of memory could be used for a LUT.
Recall that any x-to-n decoder (including the mapping unit) takes x bits as input and outputs n bits; and the set of subsets generated by the configurable mapping unit decoder are those tailored in part for the application at hand. Different applications require different sets of subsets of Zn, and do so with different constraints on speed and cost. The reconfigurable mapping unit and configurable decoder have a portion of the hardware that can be configured (off-line) to modify the output bit pattern. This allows one to freely select a portion of the subsets produced by the mapping unit or reconfigurable decoder. Hence, given an understanding of the problem to be addressed, the mapping unit and/or configurable decoder may be configured to address the specific problem.
Recall that an x-to-n decoder produces a set S of subsets of Zn. We denote the number of elements in S by Λ, that denotes the total number of subsets produced by the decoder. Clearly, Λ≦2x. The decoder allows some of the Λ subsets to be chosen arbitrarily (the independent-subsets) while other subsets are set by prior choices (the dependent subsets). Let S⊂S′ denote the portion of subsets that can be produced independently by the decoder. For instance, in a LUT decoder, all entries are independent, while in a fixed decoder (non-configurable) there are no independent subsets. We define the following two parameters that are specific to decoders. Number of independent subsets=λ=number of elements in S
Total number of subsets=Λ; clearly λ≦Λ≦2x.
Basic circuit hardware is used as building blocks, in particular fan-in and fan-out circuits, one-hot decoders, multiplexers, look-up tables (LUTs), shift registers, and modulo-α counters. A brief explanation of each follows:
Fan-in and Fan-Out:
A fan-in operation combines ƒ signals into a single output, while a fan-out takes a single input signal and generates ƒ output signals. The fan-in and fan-out operations are as follows:
For integers ƒ, z>1, let U0,U1, . . . , Uƒ−1 be ƒ signals, each z bits wide. A fan-in operation of degree ƒ and width z produces a z-bit output W whose ith bit W(i)=U0(i)∘U1(i)∘ . . . ∘Uƒ−1(i). The operator ∘ is an associative Boolean operation, such as AND, OR, NOR, etc. Diagrammatically,
For integers ƒ, z>1, let Ube a z-bit wide signal. A fan-out circuit of degree ƒ and width z produces ƒ outputs W0,W1, . . . , Wƒ−1, each z bits wide, where Wj(i)=U(i). Diagrammatically,
Fan-in and fan-out circuits of degree ƒ and width z can be constructed with a gate cost of 0(ƒ z) and a delay of 0(log ƒ).
As we noted earlier, all gates are assumed to have a constant fan-in and fan-out of at least 2; that is, the maximum number of inputs to a gate and the maximum number of other gates driven by the output of a given gate are independent of the problem size. When the fan-out of a signal in a circuit exceeds the driving capacity of a gate, buffers are inserted into the design. These additional buffers increase the cost and delay of the circuit. Gates typically have a fixed number of inputs. Realizing gates with additional inputs boils down to constructing a tree of gates. Assuming a non-constant fan-in and fan-out ignores the additional gate cost and delay imposed by these elements; Assuming some constant fan-in and fan-out (rather than a particular technology dependent constant) will not change the asymptotic costs and delays.
Fixed Decoders—One-Hot Decoders:
A x-to-n decoder is a (usually combinational) circuit that takes x bits of input and produces is bits of output, where x<n. Usually x<<n, and a decoder is used to expand an input from a small (2x)-element domain to an output from a large (2n)-element set.
Decoders can be divided into two broad classifications: (a) fixed decoders, which are inflexible, and (b) configurable decoders, where the set of subsets produced can be changed (or reconfigured) in some manner (typically off-line). One typical fixed decoder is the one-hot decoder.
In a one-hot decoder that operates on an input bit pattern of log n bits and producing an output bit pattern of n bits, each of the n-bit output patterns has only one active bit (usually with a value of ‘1’), all other bits being inactive (usually ‘0’). Such a decoder is exemplified by set S0 in Table 1. This decoder, in effect, selects one element at a time. Usually, a one-hot decoder also has a select input that allows the output set to be null. The one-hot decoder is used so often that the term “decoder” in the context of combinational circuits is usually taken to mean a one-hot decoder. A typical implementation of one-hot decoder is shown for a 4-to-16, one-hot decoder in
In general, an x-to-2x one-hot decoder has a delay of 0(x) and a gate cost of 0(x2x).
Multiplexers:
A multiplexer or MUX is a combinational circuit that selects data from one of many inputs and directs it to a single output line. In general, a 2x-to-1 multiplexer (or a (2x)-input multiplexer) takes 2x data inputs and using x control bits, selects one of the 2x inputs as the output.
An example of a typical implementation of a multiplexer with four inputs is shown in
A 2x-to-1 multiplexer can be implemented as a circuit with a gate cost of 0(x2x) and a delay of 0(x).
Look-Up Table:
A 2x×m LUT is a storage device with m2x storage cells organized as 2x words, each m bits long; see
While LUTs can be implemented in a variety of ways, all LUTs require the same two components: a memory array and a method of addressing a word in the memory array. One possible method of addressing the LUT is to use an x-to-2x one-hot decoder. The output of the one-hot decoder activates a wordline and enables the outputs of the memory storage cells. Each of the memory storage cell outputs are then fanned-in to form an m-bit output word. See
A 2x×m LUT can be implemented as a circuit with a gate cost of O(2x (x+m)) and a delay of 0(x+log m).
Shift Register (Parallel to Serial Converter):
Define an α-position shift register of width
donated by
as follows. It accepts as input a z-bit signal, and every clock cycle, outputs a (z/α)-bit slice of that signal.
bits during each cycle, and output an n-bit word every a cycles.
An α-position shift register of width
can be realized as a circuit with a gate cost of 0(z) and a constant delay between clock cycles.
Modulo-α Counter:
For any α>1, a modulo-α (or mod-α) counter increments its output by ‘1’ every clock cycle, returning to ‘0’ after a count of α−1. Modulo-α counters are well known in the art.
A modulo-α counter can be realized as a circuit with gate cost 0(log2α) and a delay of 0(log log α).
The base unit of the invention is the mapping unit, and its features are diagrammed in
The mapping unit accomplishes the expansion of the z-bit source word to the n-bit output word by “multicasting” the z-bits to n places. A multicast of z bits to n bits (or z places to n places) is a one-to-many mapping from the z source bits to the n output bits, such that each output bit is mapped onto from exactly 1 source bit, but each source bit may map to 0, 1 or more output bits. The multicast operation typically transfers the value of a source bit to the output bit it is mapped to. Here we will use it in a more general sense in that the output bit derives its value from the source bit it is mapped from, for example by complementation. Unless we note otherwise, a multicast transfers the value of each source bit to its corresponding output bits. (The inclusion of parameters y and a in the mapping unit MU(z,y,n,α) will be described later).
As an example, a fixed mapping of 4 to 8 bits can be represented as a 4 to 8 multicast, and is diagrammed in
As an illustration, consider a multicast of four bits a(3), a(2),a(1),a(0) to 8 bits b(7),b(6),b(5),b(4),b(3),b(2),b(1),b(0), such that (0)=a(0), b(1)=b(3)=b(5)=b(7)=a(3), b(2)=b(6)=a(2) and b(4)=a(1). If a=0111, then b=01010101 (
Another characterization of a multicast is in terms of an ordered partition. Consider a multicast of bits a(z−1),a(n−2), . . . , a(1),a(0) to bits b(n−1),b(n−2), . . . , b(1),b(0). An ordered z partition S0,S1, . . . , Sz−1 of zn={0,1, . . . , n−1}represents this multicast if and only for all bit positions j of a particular block Si, b(j) gets its value from a(i).
For example, the multicasts of
μ:Z2
In summary, MU(z,y,n,α) accepts as input a z-bit source word, U, and an ordered partition {right arrow over (π)} (one among 2y) as selected by the y-bit selector address, B, of
As described, a mapping unit is a decoder that accepts as input a z-bit source word u and an ordered z-partition {right arrow over (π)} of an n-element set (specified in terms of a y-bit selector address). It produces an n-bit output word. Mapping units can be classified as integral or bit-slice. An integral mapping unit generates all n output bits simultaneously and (for reasons explained below) has the parameter α set to 1. A bit-slice mapping unit, on the other hand, generates the n output bits in α rounds; i.e.,
bits at a time. One could view the integral mapping unit as a bit-slice mapping unit with α=1. Another way to categorize mapping units (both integral and bit-slice) is in terms of whether they are fixed or configurable (that is, based on whether they can be configured off-line to alter their behavior). Configurable mapping units can be general or universal. In informal terms, a universal mapping unit can produce any subset. Fixed mapping units cannot be universal (unless n is very small or a very high cost can be accepted).
A General Model of a Mapping Unit:
A general structure of a mapping unit, MU(z, y, n, α), is as shown in
We now describe two main types of mapping units, fixed and configurable. Other types and variants are described later.
Fixed Mapping Unit:
In the fixed mapping unit (FMU), the y-bit selector address is broadcast as the control signal to each MUX. That is, the selector module constructs the selector word by concatenating n copies of the selector address. Therefore, yi=y. As an example, let z=4, y=1, and n=8. Then there are 2y=2 ordered partitions mapping the 4 source word bits to the 8 output word bits. Let the mappings be as shown in
The resulting FMU is shown in
Configurable Mapping Unit:
When the ordered partitions of a mapping unit are fixed (as in an FMU), it can be shown that certain subsets cannot be produced. Here, we seek to provide a means to change the ordered partitions off-line in a configurable mapping unit (or CMU).
In a CMU, the selector module is a 2y×ny LUT (called the configuration LUT). Each ny-bit LUT-line is a selector word containing n MUX control signals, each y bits long. That is, yi=y. However, unlike the FMU, the values stored in the LUT are completely unconstrained.
It can be shown that a configurable mapping unit, MU (z,y,n,α), can be realized as a circuit having a gate cost of 0(ny2y) and a delay of O(y+log n).
As an example of the functionality of a configurable mapping unit, consider the fixed mapping unit with z=2y of
There are two important properties of the configurable mapping unit. The first is that from a perspective outside of the mapping unit, nothing changes between a fixed mapping unit and a configurable mapping unit; that is, to produce a desired subset Sji the same values are needed for signals U and B in a
configurable mapping unit as they are in a fixed mapping unit. The second is that each “grouping” of the y control bits (each corresponding to a particular MUX) in the ny-bit selector words has the same value in an FMU; If this value is ν, then each of the n output bits is derived from the ordered partition 4. However, this does not have to be the case in a CMU. For example, a word in the LUT illustrated in Table 3 could have the value 00 01 10 11 00 01 10 11; this is a combination of values of different ordered partitions for different MUXs. For example, bits 7, 6 and 5 of the 8-bit output word would be derived from {right arrow over (π0)},{right arrow over (π1)}, and {right arrow over (π2)}, respectively, as 00, 01 and 10 are the binary representations of 0, 1 and 2, respectively. This would result in multicast with the ordered partition {7,6,3,1},{4,2),{0},{5}.
Not all sets of subsets can be generated by the CMU, however, as fixing the multicasts of the bits of the source word to the MUXs may preclude certain subset considerations.
The main function of the mapping unit is to convert a set of source words into a set of output words that correspond to a given set S of subsets of Zn. In order to achieve this we consider two scenarios.
Constructing Partitions Given a Set of Subsets:
As we described earlier, an ordered partition is an abstract representation of a multicast from the source word to the output word. It is possible for different source words to use the same ordered partition to generate different output words (or subsets). Ideally, the 2z source words and 2y (ordered partition) selector words should produce 2x+y distinct output words, each of which must be one of interest to us. This requires a careful selection of ordered partitions and source words.
Here we describe a procedure (called Procedure Part_Gen) that creates partitions (multi-casts) for a mapping unit MU(z,y,n,α). As a vehicle for explanation, we will also impose an (arbitrary) order on the partitions we generate. Later we will present a method to order the partition systematically. Procedure Part_Gen generates one of many possible sets of partitions. Subsequently, another procedure will outline how one could use Procedure Part_Gen to find a suitable set of partitions.
Let S be a set of subsets of Zn that we wish the mapping unit to generate. A given subset S of zn (i.e., a particular n-bit output word having bit positions indexed 0 to n−1) induces a 1- or 2-partition πS, where πS is the 1-partition {Zn} if S is empty or S=Zn; otherwise, πS is the 2-partition {S,Zn−S}. The induced partition is not unique for a given S as πS=πZ
Let S={S0,S1, . . . , Sk−1} be a set of subsets of Zn, and let each subset Si induce the partition πS
πS
πS
πS
The procedure to create a set of z-partitions that generate a given set S of subsets of Zn is as follows. It assumes that the subsets of S are ordered in some manner. We indicate this by the symbol {right arrow over (S)}. At this stage it is not important how the subsets are ordered. We will assume that the indices of the elements of {right arrow over (S)} reflect their order. This order determines the order in which the algorithm will consider each subset and does not reflect how the partitions will be ordered.
Procedure Part_Gen({right arrow over (S)},z); generates partitions for {right arrow over (S)}, each with ≦z blocks.
The partitions π0,π1, . . . are the outputs of Procedure Part_Gen. The basic idea of the procedure is to “add” subsets in the prescribed order into the current partition until the partition has too many blocks. Then it starts afresh with the next partition. We will use this notion of “adding a subset” to an existing partition later in this discussion. We illustrate the procedure with the following example.
Let S=S0∪S1∪S2 using the sets in Table 2, let z=4, then
{right arrow over (S)}={S
0
0
,S
1
0
,S
2
0
,S
3
0
,S
1
1
,S
2
1
,S
0
2
,S
1
2
,S
2
2
,S
3
2}
(Note that S01, and S31 are not included as these are repeated elements). The induced partitions corresponding to each Sji are in Table 4. Then using the Procedure Part_Gen we obtain
πS
πS
πS
πS
πS
πS
πS
As we noted earlier, Procedure Part_Gen does not produce ordered partitions. However, we order them here to illustrate a few points. Order the partitions as
{right arrow over (π)}0={7,5,3,1},{6,2},{4},{0},
{right arrow over (π)}1={7,6,5,4},{3,2},{1,0},
{right arrow over (π)}12={7,5},{2,0},{6,4,3},{1}.
The mapping unit uses these ordered partitions with values of the source words shown in Table 5 to generate each subset of S. Actually, the table illustrates the impact of two different orders on the partitions and is discussed later. For now, it suffices to observe the first set of 4 rows that apply to {right arrow over (π)}0 that includes the subsets of S0.
We now touch upon a few points about the relationship between ordered partitions, the source word and the output word (or subset). A subset can be generated in a variety of ways, as the same z-bit source word applied to different ordered partitions can result in the same value. In addition, two different source words applied to two differently ordered partitions can result in the same value.
A subset not in S can also be produced. For example, using the z-bit source word 1010 with the ordered partition {right arrow over (π)}0 produces the output word 10111010 that corresponds to the subset {7,5,4,3,1} which is not in S.
Subsets and their induced partitions may be repeated. For example, subsets S30 and S31 of the above example are equal, the above procedure ignores repeated subsets and their induced partitions in generating ordered partitions. However, partitions corresponding to classes of algorithms or specific applications may benefit from repeating subsets, that is, to include the repeats.
A partition with fewer than z blocks, such as π1, results in “don't care” values (d) for the bits not corresponding to any block in the partition. Thus, the subset S11 with source word d011 may be produced from the source word 0011 or 1011.
In the procedure, a different sequence of considering the induced partitions
can produce a different set or number of ordered partitions. For example, if the induced partitions were considered in reverse order, that is, starting with
such that the non-ordered partitions were
etc., then the resulting ordered partitions would be {right arrow over (π0)}={7,5),{2,0),{6,4,3},{1}, {right arrow over (π1)}={7,6,5,4},{3,2},{1},{0}, and {right arrow over (π2)}={7,5,3,1},{6,2},{4,0}.
The conversion of an unordered partition to an ordered partition can be done in as many z! ways. Some of these may be more advantageous than others. An ordering that results in common source words used to produce the subsets of Si and Sk (corresponding to different ordered partitions) can be useful when the mapping unit is used as part of a larger design. This is because the same z-bit source words can be used to produce both Si and Sk. Table 5 demonstrates two ordered partitions for S0 and S1, resulting in two sets of source words for each set. Note that using ordered partition {7,5,3, 1),{6,2},{4},{0} for S0 and {7,6,5,4},{3,2},{1},{0}) for S1 results in the same set of 4 source words for both sets of subsets. We describe a similar effect for binary reductions (discussed later).
It can be shown that, if the partitions of the mapping unit MU(z,y,n,α) are not fixed, then the mapping unit can generate a number of independent subsets≧2y [log z], provided 2y log z≦2x. If the partitions are fixed and z+y≦n, then it can be proved that the number of independent subsets is 0.
{7, 5, 3, 1}, {6, 2}, {4}, {0}
{4}, {6, 2}, {7, 5, 3, 1}, {0}
{7, 6, 5, 4}, {3, 2}, {1}, {0}
{3, 2}, {0}.{7, 5, 4}, {1}
It can be shown that for integers n,z≧2, and
there exists a mapping unit that uses C values from {0, 1, . . . , 2z−1) as source words and ordered partitions to produce CY distinct subsets. That is, it is possible to construct a mapping unit with z+y bits of input (where
produces 2y(2z−2) distinct outputs (which is not too far from the theoretically maximum possible number of 2y+z=2y2x distinct outputs).
Checking a Partition for Realizability:
Suppose a partition places output word indices i and j in the same block. Suppose the hardwired connections are such that no bit of the source word connects to both MUXs i and j. In this case, we cannot select a source word bit to multicast to output word bits i and j. That is, the given partition cannot be realized on the existing hardwired connections.
Here we present an algorithm that determines whether a given partition can be realized on a given set of hardwired connections, and if so, the algorithm determines a way to order the partition so that it can be realized.
For each output word bit position 0≦j<n, let Gj denote the set of source word bits that have been hardwired to one of the data inputs of the MUX at position j. For example in the mapping unit of
H
B=∩jεBGj
Call the set HB, the source set of block B (with respect to the given set of hardwired connections).
A partition π is said to be realizable on a set of hardwired connections between the source word and MUX inputs if and only if there exists for each output position j, an assignment of a source word position ij, such that for any two output bit positions 0≦j,j′<n in (not necessarily distinct) blocks B and B′ of π
Clearly, a given partition may not be realizable on a set of hardwired connections. Is it possible to check if a given partition n is realizable, and if so, order it accordingly?
Given π, construct a bipartite graph gn=(Zz,∪π,E); that is, the set of nodes includes the bit positions of the source word and the blocks of TE. For any iεZz and Bεπ, there is an edge between i and B if and only if iεHB.
For any graph, a matching on the graph is a set of edges of the graph such that no two edges are incident on the same node. A matching is a maximum matching, if no other matching has more edges than it.
Let the given partition π have k blocks. We now show that the π is realizable if and only if g has a matching with k edges. Suppose gπ, has a matching with k edges. Clearly, this matching cannot include an edge that is incident on more than one block. Therefore the matching has exactly one edge per block. Each edge in a matching matches a block B to a unique source word bit position in the source set, HB, of B. This implies that π is realizable and in fact, the matching gives an order that must be imposed on π Conversely, if π is realizable, then it must have a unique source word index iBεZz, for each block B, such that iBεHB. Since iBεHB, we have an edge between iB and B in graph gπ. Consequently, the unique correspondence to each block B from a source word position iB constitutes a matching with k edges. Finally observe that if a k-element matching exists, then it must be a maximum matching as no matching can have more edges than there are blocks in the partition; this is because at most one edge in a matching can be incident on each node representing a block.
A simple method to impose a realizable order (if one exists) on an unordered partition is to find a maximum matching for its graph. If it has k edges (k is the number of blocks in π), then use it as indicated above to order π. If the matching has less than k edges, then no k-edge maximum matching exists and π is not realizable on the set of hardwired connections. Standard polynomial-time algorithms exist for maximum matchings on a bipartite graph.
Call the above algorithm for imposing a realizable order (if possible) on a partition as Procedure Realizable.
Constructing Realizable Partitions:
Here we outline a strategy that invokes Procedure Part_Gen and Procedure Realizable to help produce a set of realizable partitions on the existing hardwired connections. We need to include all sets of S in the set of partitions, while keeping the number of partitions small. Key to doing this successfully is to order the elements of S “appropriately.”
Let B be a block of a partition π. Clearly as B increases in size, HB tends to decreases in size. In fact, it is possible for HB to be empty, in which case the partition is clearly not realizable. If hardwired connections were random, a good strategy would be to construct partitions whose blocks have roughly the same size. This could be a guiding principle for the algorithm. If the hardwired connections follow some pattern, then that information could be used to develop a heuristic to select partitions with small blocks.
In general, determining an order that results in a realizable z-partition is not easy. In fact it is possible for the partition πS induced by a single subset S to be unrealizable.
We outline a strategy to construct realizable partitions for a given set of subsets. The strategy has three phases.
In the first phase we examine different orders for the elements of the given set S (that is, we consider different {right arrow over (S)}), then call Procedure Part_Gen collecting as many large partitions (with “nearly” z-blocks) as possible. Between each call to Procedure Part_Gen, we remove the subsets accounted for so far from S. The orders considered in this phase may be based on some knowledge of the subsets to be generated.
The second phase is based on the observation that a partition with fewer blocks has a higher likelihood of being realizable. In this phase we repeat the processing in the first phase, this time calling Procedure Part_Gen with different values for the second parameter that limits the number of blocks in a partition. That is, we try to construct a partition with many blocks, but will settle for one with few blocks, if necessary.
The third phase is needed for those subsets SiδS for which πS, itself is not realizable. The third phase splits these subsets Si further with the aim of generating the elements of Si a few at a time. This is similar to the approach followed by bit-slice mapping units (described later). In the extreme case if Si is generated one element at, a time, the strategy uses the same method currently followed in one-hot decoders.
The above approach will generate a set of realizable partitions. How small this set will be will depend on S and the amount of resources (time, memory etc.) that can be devoted to the algorithm. Although generation of an optimal number of realizable partitions is likely an intractable problem, many practical algorithms and the subsets they require exhibit a lot of structure, which makes them amenable to more analytical approaches (as illustrated in Section 4.3).
We now discuss approaches to hardwire a mapping unit and to configure a configurable mapping unit.
Configuring a Configurable Mapping Unit:
Consider now a set of 2v partition πk (where 0≦k<2v), each of which is realizable on a set of hardwired connections. By the definition of realizability, we have ordered πk into an ordered partition {right arrow over (πk)}. Let ijk be the source word position associated with output j in some block Bε{right arrow over (πk)}. This implies that source word bit ijkεGj; that is, source word bit is connected to some input (say input γjk) of the MUX corresponding to output j.
The configuration LUT consists of 2y words, each ny bits long. Denote the kth word by the n-tuple wk,o,wk,1, . . . , wk,n−1, where for any 0≦j<n, we have 0≦wk,j<y. Configure the LUT so that wk,j=γjk. This will ensure that whenever line k of the LUT (or partition {right arrow over (πk)}) is addressed, it will activate input γjk of the MUX (or bit ijk of the source word) as required.
Hardwiring a Mapping Unit:
Here we offer approaches to hardwiring a mapping unit (at manufacture) in a manner that makes some classes of partitions realizable.
For 0≦l<2v and 0≦j<n, let ml,j represent input l of multiplexer j. The aim is to assign each of these multiplexer inputs to one of the z source word bits S0,S1, . . . , Sz−1.
Map input ml,j to bit Sq, where q=(l+2yj)(mod z). We called this mapping “overlapped mapping.” For example, if y=2,z=5 and n=16, then the sequence of source word bit indices is as follows:
With overlapped mapping, a set of q consecutive multiplexers have 2v−(q−1) common source word bits. If Q is a block of an unordered partition, then any of these common source word bits form HQ and can be used to assign the order of the block as indicated earlier.
As an example with n=16 and z=5, consider the partition π={B0,B1,B2,B3,B4}={{0,1,2,3),{4},{5,6,7),{8,9,10,11},{12,13,14,15}}. Here HB
When additional flexibility is needed from the hardwiring, one could use “post-permutation,” in which overlapped mapping is applied to a permutation of the output and then the outputs permuted back as required. We illustrate this technique below.
In the earlier example of the section entitled “Hardwiring a mapping unit:” suppose that π′=B′0,B′1,B′2,B′3,B′4 where B′0={0,1,2,3}, B′1={4}, B′2={5,6,7}, B′3={10,11,12,13} and B′4={8,9,14,15}. The corresponding sets HB′
This post permutation can be achieved by a butterfly network whose switches are configurable 2-input multiplexers. This network has a 0(n log n) gate cost and 0(log n) delay; that would not significantly alter the cost of the mapping unit in most cases. Also the network can be configured as needed using standard permutation-routing algorithms for the butterfly network. It may also be possible to use a butterfly network with fewer than the standard 1+log n stages as permutations among proximate outputs may not be required. This would further reduce the cost of the butterfly network.
However, the butterfly network is a blocking network (that is, certain permutations cannot be achieved). In principle, other (more expensive) non-blocking permutation networks can be employed to overcome this problem.
It must be noted that we are mapping an realizable block B′ to a required block B, where B and B′ have the same number of elements. If B and B′ each contain m elements, then within these blocks the elements can be mapped in m! ways. Thus, there may be a lot of room for the designer to avoid blocking in networks such as the butterfly.
Although this method does not guarantee that the hardwiring would allow every partition to be realizable, many practical problems that exhibit regularity and structure tend to be more amenable to analytical approaches and individualized fine-tuning
In previous sections we described the fixed and configurable mapping units. While configurable mapping units provide more flexibility, they are more expensive than fixed mapping units. If the application does not call for such a flexibility, a fixed mapping unit may be preferable. Here we describe two hybrid mapping units that use elements of both the fixed and variable mapping units and occupy a middle ground between the flexibility and cost of the fixed and configurable mapping units.
A fixed mapping unit MU (z,y,n,α) fans-out the y-bit selector address to all n MUXs (shown as signals B0,B1, . . . , Bn−1; here B0=B1= . . . =Bn−1. In contrast, these ny bits of MUX control come from the configuration LUT in a configurable mapping unit; here y selector address bits are used to address at most 2y LUT locations, each ny bits long. In this case the signals B0,B1, . . . , Bn−1 are completely independent of each other. We now describe two hybrid schemes.
Hybrid Mapping Unit 1:
Let Zn be the set of (indices of) MUXs in the mapping unit. Divide this set into two disjoint subsets F and R (in any convenient manner that may depend on the application area). Use the y-bit selector address (suitably fanned out) to directly control all MUXs whose indices are in F; that is for all iεF, the value of Bi equals the value of the y-bits input to the mapping unit as in a fixed mapping unit. The remaining MUXs (with indices in R=Zn−F) receive their control inputs from the control LUT as in a configurable mapping unit. If R contains l≦n elements, then each LUT word has a size of ly bits. The advantage of this approach is that the LUT need not be as large as in a configurable mapping unit.
Hybrid Mapping Unit 2:
As before, let Zn represent the set of MUXs in the mapping unit. For some integer 1≦l<n, partition Zn, into l blocks; Let this partition be {R0,R1, . . . , Rl−1}. (This partition has nothing to do with the partition of the outputs associated with the multicast from the source word bits.) Use a configuration LUT with wordsize ly. If a configuration word has the form , , . . . , , then each MUX with index iεRj receives control input . As before, this reduces the size of the configuration LUT.
The advantage of both hybrids is that they reduce the size of the LUT word to ly<ny. This reduces the cost of the LUT if its size is kept the same. Alternatively, this can also allow one to increase the number of words in the LUT for the same cost as in the configurable mapping unit. An implication of this is that the configuration LUT can now store more partitions (say 2y′ partitions for some y′>y) for the same cost as the configurable mapping unit. This would require y′ bits to be input to the configuration LUT. However, only y of these bits would be used with F and each MUX (regardless of whether it is in F or R) would still use y control bits and, consequently, we would still hardwire only 2y source word bits to each MUX. This is needed to keep the collective cost of the n MUXs the same as before.
The hybrid mapping units can be viewed as a generalizations of the fixed and configurable mapping units. For the first hybrid, when F=Zn, (or R=Ø), we have the fixed mapping unit and when R=Zn (or F=Ø) we have the configurable mapping unit. The second hybrid is a generalization of the configurable mapping unit; if l=n, then we have the standard configurable mapping unit. When l=1, then all MUXs received the same control signal as in the fixed mapping unit, but if a LUT of wordsize y is used, then the y control bits of the MUXs need not be the same as they (or y′) bits input to the mapping unit.
A mapping unit MU (z,y,n,α) is universal if and only if it can, under configuration, produce any set of 2y log z independent subsets of Zn. It can be shown that a configurable mapping unit with z=2y is universal. This is because, when z≦2y, each bit of the source word can be input to every MUX. Consequently, any partition B has HB={0, 1 . . . , z−1}. Thus a universal mapping unit MU (2y,y,n, α) with 0(ny2y) gate cost and 0(y+log n) delay exists.
It is not known whether this is the only universal mapping unit.
A bit-slice mapping unit generates just part of the output subset (represented by an n-bit word) at a time. It constructs a subset over a iterations, generating
bits in each iteration. This allows the mapping unit to exploit repeated patterns, such as these demonstrated in Table 6, representing two forms of reduction. Notice that to generate 8 words, each 16 bits long, only 6 words, each 4 bits long, need to be generated.
For example, the subset S corresponding to word 0001000100010001 can be constructed over 4 iterations using the bit pattern 0001. Overall, this allows the bit-slice mapping unit to decrease the required gate cost of its internal components in situations where an increased delay is tolerable.
A possible implementation of MU(z,y,n,α) is shown in
bits every α cycles to the internal mapping unit
output of the mapping unit is stored in another shift register which parallelizes the
words into one n-bit word. A mod-α counter orchestrates this parallel to serial conversion by triggering a write-in operation on the input shift register and a write-out on the output shift register every α cycles. This allows a new source word to be input into the bit-slice mapping unit and an n-bit output q written out every α cycles.
Because the bit-slice mapping unit is a sequential circuit, we modify the definition of delay. For sequential circuits, we assume that the clock delay of the circuit to be the longer of (a) the longest path between any flip-flop output and any flip-flop input and (b) the longest path between any circuit input and output. Using this notion of delay, it can be shown that a bit-slice mapping unit MU(z,y,n,α) can be realized in a circuit with a gate costs of
and a delay of O(a(log log α+log n+y)), and the number of independent subsets is
and the maximum total number of subsets producible is Λ=2y (2z−2), provided
A point that that needs attention is the matter of how partitions play out in the bit-slice mapping unit. For example, the subsets of Table 6 produced by a fixed mapping unit MU(z,y,n,α) with z=5, 2y=2 require two ordered partitions
{right arrow over (π)}1={15,14,13,11,10,9,7,6,5,3,2,1},{12,4},{8},{0}
and
{right arrow over (π)}2={15,14,13,12,11,10,9,8},{7,6,5,4},{3,2},{1},{0}
and four, 5-bit source words (11111, 00111, 00011, 00001) to produce the n=16-bit outputs. In a bit-slice mapping unit, with
two ordered partitions {right arrow over (n)}′1={3,2},{1,0} and {right arrow over (π)}′2={3,2,1},{0} (on a smaller 4-element set) and just three, 2-bit source words (00, 01, and 11) are needed to produce the
repeated patterns 0011, 0001, 0000, and 1111. For the particular subsets of Zn shown in Table 6, the bit-slice mapping unit shows good savings. In determining whether or not a bit-slice mapping unit is suitable to a design, a variety of considerations must be taken into account.
Overall a mapping unit (fixed, configurable, integral or bit-slice) MU(z, y, n, α) has the following parameters.
(a) delay of 0(α(y+log n)),
(b) gate cost of
(c) number of independent subsets producible=
and
(d) maximum number of subsets producible=Λ=2y(2z−2), provided
A configurable decoder has the same basic functionality as a fixed decoder. An x-to-n configurable decoder accepts an x-bit input word and outputs up to 2x outputs, each n bits wide. Unlike fixed decoders, the output of a configurable decoder is not fixed at manufacture. With configuration, the n-bit outputs can be changed to a different pattern of bits, thus supplying a degree of flexibility not present in fixed decoders.
A 2x×n look-up table or LUT may be considered as a type of x-to-n configurable decoder. A 2x×n LUT also takes in an x-bit input word and outputs up to 2x words, each n bit wide, where the n-bit words are determined by the contents of the LUT's memory array. Unfortunately, this “LUT decoder” is expensive, where the gate cost of is 0(2x(x+n)). If this decoder was implemented on the same scale as a log n-to-n one-hot decoder, then x=log n. This results in a decoder that, while able to produce (after configuration) any of the 2n subsets of Zn, has a gate cost of θ(n2). On the other hand, if the LUT decoder were restricted to the same asymptotic gate cost as the one-hot decoder (that is, θ(n log n)), it would only be able to produce θ(log n) subsets of Zn (being at most a log n×n LUT). Although the flexibility of the LUT decoder is desirable, its cost does not scale well and an alternative is needed.
The configurable decoder described here is a circuit that uses a LUT (with a smaller order of cost), combined with a mapping unit. The mapping units we consider have the same order of cost as the LUT, and this allows the LUT cost to be kept as small as a fixed decoder while allowing a large number of n-bit subsets to be produced within the same order of gate cost as fixed decoders. We call these “mapping-unit-based” configurable decoders (or MU-B decoders). They take the same forms as the mapping unit itself: be integral or bit-slice, fixed or configurable. It should be noted that the MU-B decoder is always configurable, as even one using a fixed mapping unit employs a LUT.
By incorporating a “narrow” output LUT with a mapping unit that expands this narrow output into a wide n-bit output representing a subset of Zn, a device is obtained that is reduced in cost (compared to a LUT decoder) but has substantial flexibility.
As
The flexibility of the MU-B decoder depends on the LUT and the value of z, the size of the source word. While z larger than a low-degree polynomial in n does not yield significant benefits and increases the LUT cost, a small z (such as z=log n) severely limits the number of independent subsets that can be generated by the mapping unit. Without the LUT, z has to be this small to address the pin limitation problem. Thus the role of the LUT is to start from a small number of input bits and expand it to z bits, trading the value of z off with the number of locations in the LUT. This provides room for constructing the MU-B decoder to particular specifications.
The next example illustrates a MU-B decoder with a bit-slice mapping unit. Consider the sets S0={S00,S10,S20,S30} and S1={S01,S11,S21,S31} shown in Table 7. Let S=S0∪S1 and let z=5 and 2y=2. It is easy to verify that the ordered partitions for sets S0, S1 are
{right arrow over (π)}0={15,13,11,9,7,5,3,1},{14,10,6,2},{12,4},{8},{0}
and
{right arrow over (π)}1={15,14,13,12,11,10,9,8},{7,6,5,4},{3,2},{1},{0},
respectively. Then MUB(x,z,y,n, 1) fixed mapping unit would require 16 multiplexers with 2 inputs each and a 5×5 LUT to hold the values of the source words (note that this is due to an intelligent ordering; in general the LUT could be as large as a 10×5).
Now assume that, α=log n=4. Then in each iteration, the decoder must produce the
bit words from the
bit words shown in Table 8.
For these
words, three partitions are needed,
{right arrow over (π)}0bs={3,2},{1,0},
{right arrow over (π)}1bs={3,1},{2,0},
{right arrow over (π)}2bs={3,2,1},{0},
Since the original fixed mapping unit had values of z=5 and 2y=2, the number of inputs to each multiplexer in the internal mapping unit of the bit-slice mapping unit would increase by one (from 2 to 3). However, the number of multiplexers would decrease from n=16 to
This would imply a reduction is cost by a factor of
Regardless, the LUT must still supply a z-bit word to the bit-slice mapping unit (which in this case may increase to a 6-bit word based on the rounding up of
Thus, the implementation depends on the allowable costs, the number of z-bit source words and the corresponding size of the LUT, and the subsets that must be produced. Further, the ordering of the partitions can determine not only the size of
the LUT in the MU-B decoder (and thus also the values of its parameters), but also dictate the subsets that can be produced.
It can be shown that: for any α≧1, a mapping-unit-based configurable decoder MUB(x,z,y,n,α) has a delay of 0(x+log z+α(y+log n)) and a gate cost of
further, MUB (x,z,y,n,α) can produce at least
independent sets. Finally it can be shown that if 2x≦2z−2, and
then a MUB(x, z, y, n, α) can be built that produces 2x+y distinct subsets of Zn.
Finally, to properly compare a LUT decoder with MUB(x, z, y, n, α), the following can be shown:
Let P be a LUT decoder, and let C be the proposed mapping-unit-based configurable decoder, each producing subsets of Zn. If both decoders have a gate cost G, such that G=Ω(n) and G is polynomially bounded in n, then for constant π>0,
more independent subsets then P, and can produce a factor of
more dependent subsets, where Oε<1.
dependent subsets, for any 0≦ε<1.
The above results indicate that with comparable cost for the LUT decoder and the proposed mapping-unit-based configurable decoder, the MU-B decoder is more flexible, generating more subsets than the LUT decoder.
Many applications and algorithmic paradigms display standard patterns of resource use. Here we examine three cases: (1) Binary Reduction, (2) one-hot and (3) Ascend/Descend.
Binary Reduction (in General, a Total Order of Subsets):
Consider the binary tree reductions (or simply binary reductions), shown in
In discussing binary reduction, we consider a more general case involving a set S of totally ordered subsets. Let S={S0,S1, . . . , Sk−1} be a set of k subsets of Zn such that S0⊃S1⊃ . . . ⊃Sk−1; that is, the elements of S are totally ordered by the “proper superset of relation. For each 0≦i<k, let πS
For binary reduction, k=1+log n=log 2n in the above notation and Slog n=Zn.
π={S0,S1−So,S2−S1, . . . , Slog n−Slog n−1}
has log 2n blocks.
Consider the two binary reductions of
The first reduction pattern has subsets S00={0},S10={0,4},S20={0,2,4,6} and S30={0,1,2,3,4,5,6,7}. This results in the partition π0={{7,5,3,1},{6,2},{4},{0}}. Similarly, the second reduction pattern produces the partition π1={{7,6,5,4},{3,2},{1},{0}).
A binary reduction corresponding to a partition π={S0,S1−S0,S2−S1, . . . , Slog n−Slog n−1} can be implemented on MUB (log log 2n, log 2n,1,n.,α). A MUB (log log 2n, log 2n,y,n,α) can implement 2y different binary reductions. Since corresponding subsets in different binary reductions still have the same number of elements, the same set of log 2n source words can be used for all reductions; different ordered partitions need to be used, however.
The reduction corresponding to the unordered partition π0={{7,5,3,1},{6,2},{4},{0}} can be ordered so that the blocks (in the order shown) correspond to source words bits 3, 2, 1, 0 (where 0 is the least significant bit or lsb). Thus, the output set (represented as an n-bit word with bit 0 as the lsb) produced by source word s3,s2,s1,s0 and the ordered partition is s3,s2,s3,s1,s3,s2,s3,s0. To produce the sets S00,S10,S20,S30 the source words are 0001, 0011, 0111, 1111, respectively. If we now order π1={{7,6,5,4},{3,2},{1},{0}} so that the blocks (in the order shown) correspond to source word bits 3,2,1,0, then it is easy to verify that the same source words 0001, 0011, 0111, 1111 produce sets S01,S11,S21,S31, respectively.
One-Hot Decoders:
A set of one-hot subsets is a set of subsets of Zn each represented by an n-bit output word with each output word having only one active bit (usually with a value of ‘1’), all other bits being inactive (usually ‘0’). (The ideas we present also apply to decoders using an active-low logic where a ‘0’ represents inclusion of an element of Zn, in the subset and ‘1’ represents exclusion of the element from the subset.) Table 10 shows an example for active-high logic. The structure of the partition induced by a set of one-hot subsets is a particular case of a set of disjoint subsets, that we now describe. Let S={S0,S1, . . . , Sk−1} be a set of subsets of Zn, that are pairwise disjoint; that is, for any 0≦i,j<k,Si∩Sj=θ. Let Sk=Zn−(S0∪S1∪ . . . ∪Sk−1). It can be shown that the partition induced by the sets in S is
Thus if the given set S has k disjoint subsets, then the partition induced by S has at most k+1 blocks. For the one-hot set of subsets, k=n and the induced partition is {{0},{1}, . . . , {n−1}}. Moreover, because the subsets are disjoint, the product of any k-partitions πS
subsets results in a partition with at least k blocks. Thus if we were to construct z-block-partitions, we will need
partitions to capture a set of one-hot subsets. This would require
z is of substantially smaller order than n. This would make the gate cost of the MU-B decoder
which is too high to be of practical value.
Thus, the 1-hot sets are easy to produce in a conventional fixed decoder, they present a difficult embodiment for the MU-B decoder described so far. One method of producing the 1-hot subsets in a MU-B decoder is to use a LUT with 2x=n rows (or x=log n). A LUT contains a 1-hot address decoder, and since a configurable decoder MUB(log n,z,y,n,α) contains a n×z LUT, a simple switch allowing the output of the LUT's address decoder to be the output of the configurable decoder automatically allows the configurable decoder to produce the 1-hot subset. Also, the parallel decoder described subsequently teaches a simple way to construct a one-hot decoder out of MU-B decoders.
Ascend/Descend:
Communication patterns can also induce subsets. For example, if a node can either send or receive in a given communication, but not both simultaneously, then for the ASCEND/DESCEND communication patterns shown in
The subsets of the Ascend/Descend class of communications are more difficult than those of the binary reduction for a mapping unit to produce. This is because the product of all induced partitions of the 2 log n subsets of the Ascend/Descend class of communications results in an n-partition of Zn as in the one-hot case; again as z<<n, this cannot be represented by a single z-partition. However, the partitions induced by ASCEND/DESCEND subsets can be combined more effectively.
For the next discussion, we recognize that ASCEND/DESCEND subsets are in complementary pairs that induce the same partition. In fact each level of the ASCEND/DESCEND algorithm has one complementary pair; that is, there one induced partition per level of the algorithm. For the moment, we consider just a set of log n ASCEND/DESCEND sets (one per level). It is easy to show that the product any k-partitions induced by k of ASCEND/DESCEND sets has 2k blocks, each of size
Thus, one method of generating ASCEND/DESCEND subsets is to use
z-partitions, each with 2 log z source words (where z is a power of 2, say z=2k).
For example, the partition for the first level of communications is π1={{7,5,3,1},{6,4,2,0}}. Taken for log z such levels, this results in a single z-partition that with 2 log z source words can produce 2 log z of the different 2 log n subsets. For example, consider z=4. Then, log z=2, which implies that two levels can be represented by a single partition. If a partition represents levels one and two, then this results in the partition π={{7,3},{6,2},{5,1},{4,0}}.
Taken for all 2 log n subsets, this results in a total of
such partitions, and a total of 2 log z source words. Table 11 illustrates a possible ordering of the partitions and source words for the ASCEND/DESCEND sets shown in
{7, 3}, (6, 2}, {5, 1}, {4, 0}
{7, 6, 5, 4}, {3, 2, 1, 0}
Decoders can be structured in a parallel configuration utilizing a merge operation (such as an associative Boolean operation) to combine the outputs of two or more decoders. A parallel embodiment using MU-B decoders will be denoted MUB (x,z,y,n,α,P) where the parameter P denotes the number of configurable d denotes a don't care value decoders connected in parallel. Although we present examples in which a parallel configurable decoder uses multiple instances of configurable decoders of the same size and type, they could, in principle, be all different.
A parallel configurable decoder can produce sets of subsets of Zn not easily produced by the configurable decoders previously presented. The following example demonstrates the use of parallel decoders to produce the one-hot decoder.
Consider two subsets S0, S1 of Zn. Assume that an integer m divides n, or
n=km for some integer k. Then Zn={0,1, . . . , m−1,m,m+1, . . . , 2m−1,2m, . . . , im−1, . . . , (i+1)m−1, . . . , (k−1)m, . . . , km−1). For 0≦i<m and
let
q
i,0
={i+ml:0≦l<k}
and let
q
i,1
={jm+l:0≦l<m}.
Clearly, qi,0 and qi,1 are subsets of Zn. Table 12 illustrates the subsets for n=20 and m=4.
Let S0={qi,00≦i<m} and
Subsets S0 and S1 induce partitions π0=(qi,0:0≦i<m} and
respectively.
For
two z-partitions of n can generate these subsets. Put differently, each subset of S0, and S1 can be independently generated by different MU-B decoders, each using just one partition. Note that qi,0∩qj,1={jm+i}, and it can be shown that for each xεZn, there exists unique values 0≦i<m and
such that xεqi,0∩qj,1, and hence S={qi,0∩qj,1: 0≦i<m and
is the set of one-hot subsets.
A simple method to generate the one-hot subsets using parallel decoders is shown in
If m=√{square root over (n)}, then both m and
form feasible values for the input for a mapping unit; that is,
Note that a y-bit selector address is not needed as only one ordered partition is used; that is, y=0. (However, the y-bit input would allow additional subsets to be generated from additional partitions.) Thus, for the MU-B decoders
and
and y0=y1=0; also n0=n1=n. Both MU-B decoders use a single partition, hardwired into their respective mapping units, as shown in
The cost of each MU-B decoder is the cost of a √{square root over (n)}×√{square root over (n)} LUT with a MUB (½ log √{square root over (n)}, 0, n, 0,1) which is θ(n). Clearly, increasing y0 and y1 to any constant will increase the number of subsets produced without altering the θ(n) gate cost.
Two smaller log √{square root over (n)}-to-√{square root over (n)}1-hot decoders arranged as shown in this example will also produce a larger log n to 1-hot decoder with O(n) cost (this is elaborated upon further below). However, the MU-B decoder approach offers room for additional partitions and hence additional subsets (within the same asymptotic cost) and considerably higher flexibility.
If the application calls for just a fixed one-hot decoder, a MU-B decoder could be much too expensive. Here the ideas presented for a parallel MU-B decoder are adapted to a fixed one-hot decoder. Let D0 and D1 be two instances of a ½ log n-to-√{square root over (n)} one-hot decoder (see
These outputs are the same as those illustrated in Table 12. Therefore, the log n-to-n one-hot decoder outputs τk (where 0≦k<n can be obtained as τk=qk0 AND qk1.
Overall, this implementation of a one-hot decoder has 0(log n) delay and 0(n) gate cost. Compared to the conventional implementation of a one-hot decoder exemplified in
In general, a P-element parallel configurable decoder MUB(x,z,y,n,α,P) is shown in
Two decoders, say MUBi; and MUBj may use the same input bit(s) or share some common input bit(s) for their LUTs. Therefore, xi≦x and Σi=0P−1xi≧x, as each input bit is assumed to be used at least once. Similarly, yi≦y and Σi=0P−1yi≧y.
The merge unit could perform functions ranging from set operations (where ni=n, for all i) to simply rearranging bits (when Σi=0P−1 ni=n). The (optional) control allows the merge unit to select from a range of options.
Clearly, each MUBi can produce its own independent set of ni-bit outputs. The manner in which these outputs combine depends on the merge unit. For example, let each MUBi produce an n-bit output (that is, a subset of Zn,) and let Si be the independent set of subsets produced by MUBi. Let the merge operations be ∘, an associative set operation with identity S∘. Intersection, Union, and Ex-OR represent such an operation with Zn, Ø, and Ø, respectively, as identities. If each MUBi produces a set of subsets Si that includes S∘, then the whole parallel MU-B decoder produces an independent set that includes ∪i=0P−1Si.
Let MUBi have a delay of Di and a gate cost of Gi. If DM and GM are the delay and gate cost of the merge unit, then the delay D and gate cost G of the parallel MU-B decoder MUB(x,z,y,n,α,P) are
If the merge unit uses simple associative set operations (such as Union, Intersection, Ex-OR) that correspond to bit-wise logical operations, then DM=0(log P) and GM=O(nP). Since x+y≦n, the overall cost and delay for this structure is
The other variants of the MU-B decoder include a serial MU-B decoder and one based on a recursive bit-slice mapping unit. These variants are not preferred as they did not provide any additional benefit over the designs included by a stand alone mapping-unit-based decoder.
A serial MU-B decoder is shown in
Besides FPGAs and other reconfigurable computing platforms, applications of the MU-B decoder include sensor networks and external power controllers. Typical sensor networks consist of a collection of small sensor nodes (motes) that communicate to a base station through a distributed wireless network. Because the range of individual nodes is small, outlying nodes must relay their data to the base station through closer nodes. A large amount of power is expended during the receiving and transmission of data. Because of this, data must be compressed or encoded in some fashion so as to conserve power. This situation is similar to the pin limitation problem, wherein a large amount of data must be compressed in some fashion to pass through a small number of I/O pins. A decoder-based solution to the pin limitation problem could easily be applied to sensor networks, as the decoder itself would require no significant changes to the architecture of the sensor and would act as a method of compression for the data. A configurable decoder (and a reverse encoder) can serve to reduce the number of bits transmitted between sensor nodes without requiring a drastic redesign of the sensor nodes.
Power management and low-power operation have become driving factors in many applications (for instance, the design of embedded systems). An external power controller can reduce the clock frequency of a chip such that the overall power consumed by the chip is reduced. Used indiscriminately, this method can unnecessarily hurt the performance of the chip, as not all parts of the chip may require a reduction in power. A “smart” power controller could select portions of a chip for reductions in power, reducing the performance of only those portions that are not necessary for the chip's current execution. Thus, the overall power draw of the chip would be reduced without drastically affecting the performance. However, this ability is hampered by the large number of I/O pins that would be necessary for such addressing. A decoder-based solution that would allow efficient addressing of portions of a chip through a small number of I/O pins would directly address this problem. As the configurable decoder works to select a subset, this selection can be used by a smart agent that observes data from a collection of chips and issues commands to selectively power-down portions of these chips. A sharp focused selection (such as that afforded by the configurable decoder) could be useful in this environment.
The present application is a division of U.S. patent application Ser. No. 12/310,217, entitled “A Configurable Decoder with Applications in FPGAs”, filed Mar. 17, 2010, which is a 35 U.S.C. §371 application of International Patent Application No. PCT/US07/18406, filed Aug. 20, 2007, which claims the benefit of U.S. Provisional Application No. 60/838,651, filed Aug. 18, 2006. The contents of each of the foregoing applications are incorporated herein by reference hereto.
This invention was made with government support under grant number CCR-0310916 awarded by the National Science Foundation. The government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
60838651 | Aug 2006 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 12310217 | Mar 2010 | US |
Child | 14478856 | US |