This invention relates to a binary logic circuit for determining a ratio
particularly for the case in which x is an unsigned variable integer and d is a positive integer constant of the form 2n±1.
It is a common requirement in digital circuits that hardware is provided for calculating a ratio
for some input x, where d is some constant known at design time. Such calculations are frequently performed and it is important to be able to perform them as quickly as possible in digital logic so as to not introduce delay into the critical path of the circuit.
Binary logic circuits for calculating a ratio
are well known. For example, circuit design is often performed using tools which generate circuit designs at the register-transfer level (RTL) from libraries of logic units which would typically include a logic unit for calculating a ratio
Such standard logic units will rarely represent the most efficient logic for calculating
in terms of circuit area consumed or the amount of delay introduced into the critical path.
Conventional logic for calculating a ratio
typically operates in one of two ways. A first approach is to evaluate the ratio according to a process of long division. This approach can be relatively efficient in terms of silicon area consumption but requires w−n+1 sequential operations which introduce considerable latency, where w is the bit length of x. A second approach is to evaluate the ratio by multiplying the input variable x by a reciprocal:
Thus the division of variable x by 2n−1 may be performed using conventional binary multiplier logic arranged to multiply the variable x by a constant c evaluated at design time. This approach can offer low latency but requires a large silicon area.
According to a first aspect of the present invention there is provided a binary logic circuit for determining the ratio x/d in accordance with a rounding scheme, where x is a variable integer input of bit length w and d is a fixed positive integer of the form 2n±1, the binary logic circuit being configured to form the ratio as a plurality of bit slices, the bit slices collectively representing the ratio, wherein the binary logic circuit is configured to generate each bit slice according to a first modulo operation for calculating mod(2n±1) of a respective bit selection of the input x and in dependence on a check for a carry bit, wherein the binary logic circuit is configured to, responsive to the check, selectively combine a carry bit with the result of the first modulo operation.
The binary logic circuit may be configured to generate each bit slice i of the ratio by performing the first modulo operation x[w−1:n*(i+1)]mod(2n−1), where i lies in the range 0 to
The binary logic circuit may be configured to, for each bit slice i, perform the check for a carry bit by:
The binary logic circuit may be configured to not combine a carry bit with the result of the first modulo operation in the event that the relevant condition is not satisfied.
For a given bit slice i, the check for the carry bit may use the result of the first modulo operation for mod(2n±1) of the respective bit selection of the input x.
The binary logic circuit may comprise a plurality of modulo logic units each configured to perform a first modulo operation on a different respective bit selection of the input x so as to generate a set of modulo outputs.
The modulo outputs from at least one of the modulo logic units may be used to generate more than one bit slice of the ratio.
A plurality of the modulo logic units may be configured to operate in parallel.
A majority of the modulo logic units may be configured to operate in parallel.
The binary logic circuit comprises combination logic may be configured to combine the set of modulo outputs so as to generate the bit slices of the ratio.
The combination logic may be an adder tree.
The combination logic may be configured to, for each bit slice i, perform the check for a carry bit.
The modulo outputs may be d-bit one-hot encodings.
The binary logic circuit may comprise an adder tree configured to determine the result of one or more of the first modulo operations by combining the results of first modulo operations on shorter bit selections from x to form the results of first modulo operations on longer bit selections from x, the binary logic circuit not including logic to evaluate those first modulo operations on longer bit selections from x.
The logic elements of the adder tree may comprise only AND and OR gates.
In the case d=2n−1, the binary logic circuit may comprise a plurality of full adders each configured to perform, for a given bit slice i, the first modulo operation x[w−1:n*(i+1)]mod(2n−1) and each full adder comprising:
The plurality of full adders may be arranged in a logic tree configured so as to generate each bit slice i of the ratio.
The reduction logic may be configured to interpret the bit selection of x as a sum of n-bit rows x′, each row representing n consecutive bits of the bit selection of x such that each bit of the bit selection of x contributes to only one row and all of the bits of x are allocated to a row, and the reduction logic is configured to reduce the sum of such n-bit rows x′ in a series of reduction steps so as to generate the sum of the first n-bit integer β and the second n-bit integer γ.
Each reduction step may comprise summing a plurality of the n-bit rows of x′ so as to generate a sum of one or more fewer n-bit rows.
The reduction logic may be configured to, on a reduction step generating a carry bit for a row at binary position n+1, use the carry bit as the least significant bit of the row.
The reduction logic may comprise one or more reduction cells each configured to sum a plurality of the n-bit rows of x′ so as to generate a sum of one or more fewer n-bit rows.
The reduction logic may comprise a plurality of reduction cells and the plurality of reduction cells may be configured to operate in parallel on the rows of x′ at each reduction step.
The length of the bit selection from input x for bit slice i may be vi and the reduction logic may comprise at least
reduction cells each operating on a different set of three rows of x′ such that, at each reduction step, the number of rows is reduced by approximately a third.
The reduction logic may comprise a plurality of reduction stages coupled together in series, each reduction stage comprising one or more reduction cells configured to operate in parallel so as to perform a reduction step.
The reduction logic may comprise a number of reduction stages equal to the number of reduction steps required to reduce the sum of n-bit rows x′ to the sum of n-bit integers β and γ.
The reduction logic may be configured to iteratively operate the one or more reduction cells over the rows of x′ until two rows remain which represent n-bit integers β and γ.
The binary logic circuit may further comprise:
The exception logic may be configured to form a determination result of 1 if all of the bits of the bit selection of x are 1 and a determination result of 0 if not all of the bits of the bit selection of x are 1, and the output logic comprising a XOR gate configured to receive the addition output and determination result as its inputs so as to form as its output the result of the first modulo operation.
The addition logic may comprise a compound adder configured to concurrently form a first sum β+γ and a second sum β+γ+1, and to provide the sums to a multiplexer configured to select between the first and second sums in dependence on whether the second sum generates a carry bit; the addition output of the multiplexer being the second sum if a carry bit is generated and the first sum if a carry bit is not generated.
The addition logic may comprise an adder configured to calculate the sum of the first and second binary values and 1, and the addition logic being configured to provide the n least significant bits of the sum right-shifted by n as the addition output.
In the case d=2n+1, the binary logic circuit may comprise groups of one or more full adders, each group configured to perform, for a given bit slice i, the first modulo operation x[m−1:i*n]mod(2n+1) where
each full adder of a group comprising:
The binary logic circuit may be configured to generate bit slices of length n.
According to a second aspect there is provided a method for determining the ratio x/d in a binary logic circuit in accordance with a rounding scheme, where x is a variable integer input of bit length w and d is a fixed positive integer of the form 2n±1, the method comprising:
The result of each first modulo operation may be a d-bit one-hot encoding.
The result of one or more of the first modulo operations may be determined by combining the results of first modulo operations on shorter bit selections from x to form the results of first modulo operations on longer bit selections from x.
The performing a first modulo operation may comprise:
The binary logic circuit may be embodied in hardware on an integrated circuit. There may be provided a method of manufacturing, at an integrated circuit manufacturing system, the binary logic circuit. There may be provided an integrated circuit definition dataset that, when processed in an integrated circuit manufacturing system, configures the system to manufacture the binary logic circuit. There may be provided a non-transitory computer readable storage medium having stored thereon a computer readable description of an integrated circuit that, when processed in an integrated circuit manufacturing system, causes the integrated circuit manufacturing system to manufacture the binary logic circuit.
There may be provided an integrated circuit manufacturing system comprising:
There may be provided computer program code for performing methods as described herein. There may be provided non-transitory computer readable storage medium having stored thereon computer readable instructions that, when executed at a computer system, cause the computer system to perform the methods as described herein.
The present invention will now be described by way of example with reference to the accompanying drawings. In the drawings:
The following description is presented by way of example to enable a person skilled in the art to make and use the invention. The present invention is not limited to the embodiments described herein and various modifications to the disclosed embodiments will be apparent to those skilled in the art.
Examples described herein provide an improved binary logic circuit for calculating a ratio
where x is a variable input and d is a fixed integer divisor. In the examples described herein, x is an unsigned variable integer input of known length w bits, d is a positive integer divisor of the form 2n±1, div is the output which has q bits. It will be appreciated that the principles disclosed herein are not limited to the particular examples described herein and can be extended using techniques known in the art of binary logic circuit design to, for example, signed inputs x, rounding schemes other than round to negative infinity, and divisors of related forms. For example, division by a divisor of the form d=2P(2n±1) may be readily accommodated by right-shifting x by p before performing division by 2n±1 according to the principles described herein.
A binary logic circuit 500 for evaluating a ratio
is shown in
Data store 501 may comprise a register for each bit of the input variable, as is illustrated in
The binary logic circuit 500 in the present example is configured to calculate:
The floor of the ratio x/d is calculated since rounding to negative infinity is being used.
The modulo output 514 of this division operation is therefore given by:
modulo=x−d*div (3)
Such that 0≤modulo≤d−1.
The present invention recognises that the division operation of equation (2) can be efficiently evaluated in a piecewise manner by calculating bit slices of length n of the ratio div from selections of bits of the input variable x. For a given n and known bit length w of the binary input x, the shortest bit length of the binary ratio div which can represent all possible outputs may be expressed as q:
Bits n*i to q−1 of output ratio div may be expressed as follows in equations (5) below, where i is selected so as to generate the q bits of div. The maximum value of the index
For example, for a 16 bit binary input x (corresponding to w=16) and a divisor of 3 (corresponding to n=2 for a divisor of the form d=2n−1), q is 15 and i will take the range [0, 7].
Taking modulo 2n to yield the least significant n bits gives an expression for the ith n bit slice of the output div:
For the case d=2n−1, equation (6) simplifies to the conditional equation:
div[n*(i+1)−1:n*i]=x[w−1:n*(i+1)]mod(2n−1)+?1:0 (7)
Where the element +?1:0 indicates that a carry bit is added when the following condition is true and otherwise no carry bit (or zero) is added:
x[w−1:n*(i+1)]mod(2n−1)+x[n*(i+1)−1:n*i]≥2n−1 (8)
For the case d=2n+1, equation (6) simplifies to the conditional equation:
div[n*(i+1)−1:n*i]=x[w−1:n*(i+1)]mod(2n+1)?1:0 (9)
Where the element −?1:0 indicates that a negative carry bit is added when the following condition is true and otherwise no negative carry bit (or zero) is added:
x[w−1:n*(i+1)]mod(2n+1)+x[n*(i+1)−1:n*i]<0 (10)
In this manner, a potentially complex division operation may be reduced to, for each value of i, performing modulo arithmetic on bit selections from the input variable x. Each modulo calculation performed in accordance with equation 7 or 9 represents an n bit slice of the desired output div. The collective output from the set of i modulo calculations therefore represent the complete q bit output div 513. The output 513 may be stored in one or more bit registers. Not all of the bits of output 513 need be available at the same time—for example, the n bit slices of div may be stored as and when they are generated by the slice logic until the complete q bit output div 513 is present at the registers.
The parameters d and w are known at design time such that the binary logic circuit may be configured as a fixed function hardware unit optimised to perform division by d. Any logic able to perform modulo arithmetic in accordance with equations 7 and 8 or 9 and 10 (as appropriate to the particular form of d) may be implemented at the slice logic 510.
The reduction of the division operation to a set of modulo calculations performed on portions of the input variable enables division operations to be performed at low latency while consuming limited area on a chip. The approach further offers substantial scope to optimise the binary logic circuit 500 to minimise latency and/or chip area consumption. Typically latency and chip area consumption are at least to some extent competing factors. The approach described herein allows a desired balance between them to be achieved.
The above approach enables substantial parallelization of a division operation. For example, multiple logic units may be provided at the slice logic so as to enable two or more (and possibly all) n bit slices of div to be generated in parallel. This enables very low latency to be achieved. The repetition in the modulo calculations performed in order to form the bit slices of div may alternatively or additionally be used to reduce the area consumed by the binary logic circuit. For example, instead of providing a logic unit for each instance of a given modulo calculation, fewer logic units (possibly one) can be provided than the number of instances of that calculation with the result of the calculation performed by a logic unit being used at multiple instances in the slice logic where that result is required.
The bit selections expressed in equations 7 to 10 above for each i are illustrated in
Continuing the example when n=2 for division by a divisor of the form d=2n−1, if the length of the input variable x is w=16 bits then the length of the output div is, according to equation 4, q=15 bits. In this case, i takes the range [0, 7]. Thus the output div is formed in seven bit slices each of 2 bits and a single bit slice for the most significant bit div[14]. In this example binary logic circuit 500 is configured to perform division of a 16-bit input variable x by 3.
The slice logic may comprise modulo logic 511 and combination logic 512. The modulo logic is configured to perform modulo calculations of the form x[j:k]mod(2n±1). The modulo logic 511 may comprise a set of logic units each configured to perform a modulo calculation—in some examples, each such unit may be configured to perform a different one of the possible modulo calculations given the range of values of j and k. The combination logic 512 is configured to combine the outputs from the modulo calculations according to equations 7 and 8 or 9 and 10 (as appropriate to the form of d) so as to generate the bit slices of div and hence the complete output 513. The combination logic 512 may perform the addition of a carry bit to each bit slice in accordance with equations 8 & 10 and using the outputs of the modulo calculations performed at modulo logic 511. The processing by the modulo logic and combination logic need not be sequential. For example, the combination logic may be in the form of a logic tree within which the modulo logic is arranged. In some embodiments, the modulo logic and combination logic may be one and the same unit.
Any suitable logic for calculating the modulo of x/d may be used. For example, modulo output 514 may be calculated at the slice logic 510 from output div 513 according to equation 3. However, this approach introduces additional latency and consumes additional chip area. It is preferable that the modulo of the division operation x/d is calculated concurrently with div at the slice logic. This could be achieved, for example, by configuring the slice logic to calculate x[w−1:0]mod(2n±1)=modulo. A full adder configured in the manner described below could be used to calculate this result. More generally, any other suitable logic may be used, including, for example, logic configured using the one-hot encoding described below. It is advantageous if, whichever logic is used, the slice logic is configured to generate the modulo result at least partly using the results of one or more modulo operations x[j:k]mod(2n±1) calculated in accordance with equations 7 and 9. For example, in the one-hot encoding example described below, the results of one or more modulo operations calculated using bits of x in the range x[w−1:n] could be combined with the result of x[n−1:0]mod(2n±1) in the manner described for the one-hot example so as to yield the complete modulo=x[w−1:0]mod(2n±1).
An example of the operations performed by the binary logic circuit 500 is illustrated in the flowchart of
The generation of a bit slice 1206 is repeated 1209 for each bit selection so as to generate the set bit slices over i. This repetition need not be performed sequentially and the indicated repetition 1209 in
Note that the calculation performed at step 1202 need not be the same calculation in order to form each bit slice: the result of the modulo calculation for some bit selections from x may be determined based on the result of prior modulo calculations for other bit selections from x.
Exemplary implementations of the binary logic circuit 500 will now be described with respect to
One-Hot Modulo Encoding
It can be advantageous to use an encoding other than a simple binary encoding for the results of the modulo calculations performed by the slice logic 510. In this example, a one-hot encoding is used in which the result of each modulo calculation of the form x[ j:k]mod(2n±1) performed by the slice logic is encoded as a value d bits wide with only one high bit (i.e. one-hot) and all other bits zero. Configuring the slice logic 510 to generate one-hot encodings enables the results of modulo calculations to be combined in a particularly efficient logic tree so as to form the bit slices of output div. A one-hot encoding is particularly advantageous when n is small and can be employed for any odd value of d. This is because a one-hot encoding is d bits in length rather than the binary length of └log2(d−1)┘+1 and hence the amount of logic required typically grows as d2. n may be considered small when, for example, the divisor d=2n±1 is 3, 5, 7. In some implementations, n may also be considered small when d is 9, 11, 13, etc.
The number of bits required to represent a modulo calculation a mod b in binary form is └log2(b−1)┘+1, where the output may take any value between 0 and b−1. For example, in the present case with a divisor d=2n−1, a value of n=2 means that only └log2(22−1−1)┘+1=2 bits are required to represent the output of the modulo operations in binary for the divisor d=3. A one-hot encoding encodes the outputs using d bits—in this case, 3 bits. This follows because there are only 3 possible outputs of a mod 3 and hence only three one-hot encodings are required to represent those outputs:
All other values of a in the modulo calculation wrap around onto one of these three values. These values represent one choice of encoding—other encodings may be used where the possible one-hot representations are differently assigned to the possible outputs of the modulo operations. The modulo logic units may be configured to generate the relevant one-hot representations on performing a modulo operation. In other examples, the slice logic 510 may be provided with a look-up table (LUT) and configured to replace the binary outputs of the modulo logic units with the corresponding one-hot representations defined at the LUT.
Generally, given d-bit encodings of a and b and assuming d is odd, the ith bit of the one-hot encoding of (a*2k+b)mod d can be calculated as follows. There are d2 number of choices of values for (a, b) and for exactly d of these choices (a*2k+b)mod d=i. For example, the case (a, b)=(0, i). This is because there exists a value r such that (r*2k)mod d=1.
The set of d values such that (a*2k+b)mod d=i is (a, b)=(r*(i−j)mod d, j) for j=0, . . . , d−1. The ith bit of the one-hot encoding may then be given by:
And similarly for the 1st and 2nd bits of the 3-bit one-hot encoding.
Logic could be provided at the slice logic to generate these encodings for a given input, but preferably the one-hot encodings are written to a look up table so as to, on receiving the result of a modulo calculation, enable the logic to read out the one-hot encoding of that result.
An advantage of using a one-hot encoding is that the one-hot representations 810-813 of the outputs of the modulo logic units 805-808 may be efficiently combined at an logic tree 814. The logic tree 814 is an example of combination logic 512. This is because the use of a one-hot encoding allows the output of modulo operations on larger bit selections of x to be determined from combinations of the outputs of modulo operations on smaller bit selections of x. This follows from the fact that:
2n+imod(2n±1)=(∓1)i (11)
For example, x[f:g] mod (2n±1) and x[g−1:h]mod (2n±1) may be combined together to form x[f:h]mod (2n±1) since according to equation 11:
(a*2n*i+b)mod(2n±1)=(∓a+b)mod(2n±1) (12)
So following on from the example above where n=2,d=3, combinations of the outputs of mod 3 operations can be represented by one-hot encodings as follows:
Any suitable logic tree 814 may be used to combine the one-hot representations of the modulo outputs so as to form the output div. However, since only 1 bit is high for each representation of a modulo operation in the one-hot encoding scheme, it is advantageous to implement the logic tree using just AND and OR gates. This minimises complexity, latency, and the area consumed by the binary logic circuit.
Furthermore, the structure of logic tree 814 may be readily optimised to minimise delay on the critical path or to minimise chip area consumed by the circuit. Given r inputs to the tree, in a first example the logic tree may be structured to have a depth of order log(r) with r*log(r) combinations in the tree. This structure minimises critical delay. In a second example, the tree may be structured to have a depth of order r with r combinations in the tree. This structure minimises circuit area. Intermediate structures between these exemplary possibilities exist which may provide a desired balance of latency and circuit area.
The logic tree 814 may have a structure analogous to, for example, a carry look-ahead adder (which would be a tree of depth log(r)), or a ripple carry adder (which would be a tree of depth r).
It will typically be necessary to decode the one-hot encoded slices of div so as to form a binary representations. This may be achieved by use of a suitable lookup table—e.g. logic which for each possible input provides a predefined output.
In one example, the decode from a one-hot modulo encoding of x may be calculated as follows. In this example, the input x has been sliced up into k slices in the following way:
x[ik−1−1:ik−2], x[ik−2−1:ik−3], . . . , x[i1−1,i0] where ik−1=w and i0=0
The output div may be calculated according to the same sliced structure. For example, div[ij−1:ij−1] can be calculated from x[w−1:ij] mod d and x[ij−1:ij−1]. x[w−1:ij] mod d can be calculated by using the one-hot encoding and combining the more significant bit slices x[ik−1−1:ik−2], . . . ,x[ij−1−1:ij] in the manner described above.
The output div is given by:
which follows from a general application of equations 5 and 6 above because:
Using the one-hot encoding of x[w−1:ij] mod d and the bit slice x[ij−1:ij−1] as inputs either a look-up table or explicit logic function (feasible depending on the number of table inputs), which returns the values of 0 through to 2i
This can be done similarly for all bit slices, except the most significant div[ik−1:ik−2] which is solely a function of the bit slice x[ik−1:ik−2] and the modulo value can be assumed to be zero in the decoding lookup table.
If the bit-slices are small enough, it's typically easy for someone skilled in the art to make some logic by hand to perform this decode form one-hot to binary efficiently and the decode (and also initial one-hot encode) logic will be cheap. However, the trade-off is the smaller the bit slices, the more of them that there are to encode/decode and combine using the combination logic (trees with more initial nodes are needed).
It is possible to split up the input x into portions which differ in length by an amount different to n because, given that i is known at design time along with the one-hot encodings of a and b, the amount of logic required to calculate the one-hot encoding of (a*2i+b)mod d is independent of the value of i and so does not need to be restricted to a multiple of n for d=2n±1.
This enables the amount of combination logic to be reduced by increasing the width of the initially encoded binary portion of x, since this will mean there are fewer one-hot values to combine. However, this will be at the expense of a more complex one-hot encoding of the results of the modulo operations and consequently of the decoding of the slices of div (although the encode and decode steps can be performed using look-up tables).
Alternatively, the width of the binary portions of x may be decreased, but this means there are more one-hot encodings to combine and more modulo values to compute to calculate this increased number of bit slices of div—in other words, an increased amount of combination logic is required.
For example, in the case that d=3 the bit slices generated of the output div could be of width 1 bit rather than n=2 bits. This makes the initial one-hot encoding straightforward since only a NOT gate is required: i.e. 0 becomes 000, 1 becomes 001, and so generally x[i] becomes 0x[i]NOT(x[i]). The decode of the bit slices is similarly straightforward because equation 7 may be rewritten in this d=3 case as:
div[i]=?1:0=2*(x[w−1:i+1]mod 3)+x[i]≥3?1:0
Suppose the one-hot encoding for x[w−1:i+1]mod 3=e(i+1)[2:0] then we have div[i]=e(i+1)[2] OR (e(i+1)[1] AND x[i]) which may be implemented using a simple ANDOR logic gate.
The cost of decreasing the width of the binary portions of x is that the combination logic tree is larger because it now has twice as many inputs. However, the encode/decode logic can be smaller. The optimum binary portion width in terms of the size and/or speed of the resulting circuit can be determined at design time through appropriate modelling.
Most generally the bit slices do not need to be all of the same length. This can be accommodated through appropriate modification of the combination logic at the logic tree 814.
In other examples, encoding schemes other than one-hot encoding may be used.
Signed Case
It will be appreciated that the one-hot binary logic circuit shown in
Where onehot(x[i]) expresses the one-hot signals {“001”,“010”,“100”} as the numbers {0,1,2} respectively such that x[i]=0,1 for i=0, . . . 6 and x[7]=0,1 (where the value x[7] represents is −1*x[7] since it's the signed bit) and the difference between i=0, . . . 6 and i=7 is 0, −1 mod 3=0,2. Performing division by 3 using slices of length 1 yields a signed 7-bit div output:
div[i]=onehot(x[7:i +1]mod 3)[2] OR(onehot(x[7:i +1]mod 3)[1] AND x[i]) for i=0, . . . , 6
Note that the sign bit of the output div[6] logic simplifies to:
This is the sign bit of the input for a round to negative infinity (RNI) rounding. A negative input always gives a negative output.
In general, it is possible to “sign-extend” the input by a bit, so the sign-bit is now 1-bit more significant (e.g. x[n] rather than at x[n−1]). This results in the sign-bit no longer being part of the slicing operations and it can be handled separately. However, the modulo value of the most significant bit slice must still take into account the influence of the sign-bit on the modulo value: in other words, x[n−1:ik] mod d should still be encoded as though x[n−1] is the sign-bit.
The most significant bit-slice may therefore be evaluated according to a slightly altered equation:
Where x[n] here is interpreted as having a value of either −1 or 0 (rather than 1 or 0).
In this equation there is now a potentially non-zero mod term which isn't present in the unsigned equation:
The output sign-bit should always match the input sign bit for RNI rounding. Unsigned division could also be treated in this manner, but the sign bit x[n] is always trivially zero, so the mod term in the above equation never occurs.
Full adder reduction
A second exemplary implementation of the binary logic circuit is shown in
A full adder 100 for evaluating y=x mod(2m−1) for a given value of m and an input value x is shown in
Reduction logic 101 operates on a binary input value x which in the present examples will be a bit selection x[j:k] from the input variable x shown in
While the range of x is [0,2v−1], the range of x′ is [0,k*(2m−1)] where k is the number of rows of x′ and at 402 is less than or equal to
Consider a simple example of a 12 bit number x=110101101111. This number may be expressed in the form x′ as a sum of consecutive m-bit portions of x as follows:
The one or more reduction cells of reduction logic 101 may be one or more full adders arranged to reduce the rows of x′ down to a sum of two rows of length m. A full adder receives two one-bit values and a carry bit as its inputs and outputs the sum of those bit values along with a carry bit. A full adder can therefore be used to sum the bits of a pair of rows of x′ so as to compress those two rows of m bits into a single row of m bits and a carry bit. As is known in the art, this can be achieved by using a cascade of m full adders or by using fewer than m full adders and iteratively operating one or more of those full adders on the output of previous full adders.
Other types of reduction cells could alternatively be used, such as half adders. It will be appreciated that there are a large number of possible adder designs which can be used to reduce a sum of a plurality of m-bit binary numbers to a sum of two m-bit binary numbers. Any suitable adder design could be used in the reduction logic to reduce the range of x in accordance with the principles described herein.
The reduction logic 101 may comprise one or more reduction cells. In general, any kind of reduction cell able to reduce a binary sum of s rows down to a binary sum of t rows (a s to t reduction cell) may be used. The one or more reduction cells are configured so as to provide a pair of rows x′ as the output of the reduction logic. Multiple reduction cells may be arranged in series or in parallel. In accordance with the teaching below, the reduction logic is configured to, following each reduction step, wrap-around carry bits at bit position m+1 to the first bit position.
The reduction logic 101 of the full adder operates until the rows of x′ have been reduced to two rows, at which point x′ lies in the range [0,2*(2m−1)]. These two rows of length m of x′ are referred to as β and γ.
An advantageous form of reduction cell 302 will now be described which provides high speed compression of x′. Each reduction cell comprises m full adders configured to operate on three rows of x′ each of length m. Each full adder operates on a column of the corresponding bits of each row so as to compress the three rows into two rows of length m. The operation of the reduction cell is illustrated schematically in
Prior to making use of the pair of output rows of a reduction cell, its carry bit 1009 which exists (logically at least) in the m+1 column/bit position is wrapped around to the first column/bit position. This is acceptable because 2mmod(2m−1)=20 and ensures that the rows of x′ remain aligned and of length m bits. The wrap-around of carry bits is described in more detail below with respect to
By operating a reduction cell comprising m full adders on the columns of a set of three rows of x′ in the manner shown in
rows of length m, plus potentially a row of length less than m. Empty bits in any rows of less than m can be set to 0.
For a binary input of length v,
reduction cells may be provided so as to reduce the number of rows of x′ by around a third.
represents the initial number of rows of x′, which may include a row of length less than m. When v is an integer multiple of m, the number of reduction cells is
As the number of rows of x′ becomes smaller, the number of reduction cells also becomes smaller.
In order to reduce the number of rows of x′ down to two, a set of reduction cells at the reduction logic may be configured to operate iteratively on x′ until the number of rows of x′ reaches two. For example, reduction logic comprising
reduction cells may be configured to iteratively perform a series of reduction steps on the rows of x′, with fewer and fewer reduction cells being required at each reduction step, until only two rows remain. This can be achieved through the use of sequential logic and a clock signal to schedule the outputs of the reduction cells for the previous reduction step into the inputs of the reduction cells for the next reduction step. However, such a configuration would typically allow only one reduction step (iteration) per clock cycle.
It is preferable that the reduction logic comprises multiple stages of reduction cells arranged in series with each stage of reduction cells receiving its input from the output of the previous stage. The reduction cells of each stage may be configured to operate in parallel. As many stages of reduction cells are provided as are required to reduce an input x down to a sum of binary values of length m in a single operation without iteration. This arrangement is shown for reduction logic 101 in
Each reduction cell 304 comprises a set of full adders as shown in
The first reduction stage 301 comprises
reduction cells 304 each having m full adders arranged to operate in parallel on a set of three rows in the manner shown in
A second reduction stage (e.g. 302) is arranged to operate on the output of the first reduction stage and comprises a number of reduction cells appropriate to the number of rows provided at the output of the first reduction stage. For example, if the number of output rows from the first stage is b then the second reduction stage comprises └b/3┘ reduction cells 304. A sufficient number of further reduction stages are arranged in series in this manner until the output of a final reduction stage 303 includes only two rows. The final reduction stage 303 comprises a single reduction cell 304 which is configured to operate on the three output rows of the preceding reduction stage.
In this exemplary configuration, the total number of full adders present in the reduction logic will be
full adders. It will be appreciated that where a row has fewer than m bits, some of the inputs to the full adders will be zero. Such full adders could be considered to be half adders in which case there will be
full adders and (−v)mod m half adders. The configuration described represents reduction logic having the minimum number of reduction stages.
Reduction logic configured in this manner with a series of reduction stages each comprising one or more reduction cells operating in parallel on the rows of x′ would typically be able to perform the compression of x down to two rows of x′ of length m in a single clock cycle of the digital platform on which the reduction logic is running. The use of serial reduction stages therefore offers a high speed configuration for reducing an input x to a sum of two rows β+γ which satisfy:
xmod(2m−1)=(β+γ)mod(2m−1) (15)
As an example, consider an input x of length v=48 for the case m=5. For the preferred case, the first stage of the reduction logic comprises
reduction cells for operation on the initial set of 10 rows of x′, leaving a short row of 3 bits unallocated to a reduction cell. Each reduction cell operates in the manner illustrated in
It will be appreciated that full adders may be arranged in reduction cells in various other configurations. Because the number of rows available as inputs to a stage will be an integer multiple of 3, it is not always possible for a reduction stage to operate on all of the available rows. There are typically multiple ways of arranging the full adders within the reduction logic, whilst still achieving the same number of reduction stages. This freedom allows designers to, for example, optimise the reduction logic so as to minimise its area/delay/power when processed into a physical logic circuit.
Many other configurations of reduction logic are possible for compressing an input x down to two rows of length m. The reduction logic could comprise reduction cells other than full adders, such as ripple-carry adders which can be used reduce two rows down to one row. However, it is preferable not to use ripple carry adders configured to add pairs of rows in parallel implementations because the carry propagation of ripple carry adders results in relatively slow performance compared to other types of reduction cell.
The output of the first reduction step performed by the reduction logic on x′ is illustrated at 406 in
The reduction performed by the first reduction step generates carry bits 404 and 405. As described above, any carry bits generated at the mth bit position by a reduction step (e.g. 404, 405) are wrapped-around to the first, least significant bit position (e.g. 407, 408) as shown at 406 in
In the example described above, each reduction step reduces the number of rows of x′ by around a third. In other examples in which other types or arrangements of reduction cell are used, the number of rows may be differently reduced at each reduction step—for example, arrangements of reduction cells may be used which reduce 4 rows to 3, or 7 rows to 3. Such arrangements may generate more than one carry bit which is to be wrapped-around to empty least significant bit positions in accordance with the principles described above.
In the case that v mod m≠0, then in the initial expression of x′ 402 there will always exist a row with a 0 bit for every possible input value of x. If a 0 is one of the three inputs to a full adder, then one of the two outputs must also be a 0, since only if each input is 1 is each output 1. Hence at least one bit of one of the rows of x′ will be 0 after every reduction step performed by reduction logic 101. Since x′ lies in the range [0,2*(2m−1)], it follows that only in the case when v mod m=0 and x=2n−1 (i.e. all n input bits are 1) does x′ attain its maximum value of 2*(2m−1) in which all the bits in the rows of x′ remain 1. This point is relevant to the discussion below in which optional exception logic 112 (used in the case that v mod m=0) is provided in order to reduce the critical path delay at addition logic 104.
It is to be noted that
The usefulness of expressing an v-bit input x as a sum x′ of two m-bit numbers β and γ for the purpose of calculating y=x mod(2m−1) will now be demonstrated.
A representation of a binary sum for calculating y=x mod(2m−1) is shown in
The significance of this calculation will now be explained. Note that columns 601 and 602 are merely schematic and need not represent independent sums.
In the case that v mod m=0 and x=2v−1 (all of the digits of x are 1), the value x′=β+γ=2m+1−2. This case may be handled separately at exception logic in the manner described below. For all inputs of x when v mod m≠0, and for all inputs of x when v mod m=0 except the above noted case when x=2v−1, the value x′=β+γ lies in the range [0.2m+1 −3]. Consider a first part of that range in which (β+γ)ϵ [0.2m−2]. It follows from this possible range of values of β+γ that:
β+γ=((β+γ)mod(2m−1))=(xmod(2m−1)) (16)
In other words, y is in this case equivalent to the sum β+γ. This is because the sum β+γ+1 in column 601 does not generate a carry bit since 0≤β+γ+1<2m. The output 603 of the binary sum shown in
Now consider a second part of the range of x′ in which (β+γ)ϵ [2m−1,2m+1−3]. In this case the sum β+γ+1 in column 601 does generate a carry bit in the (m+1)th column because 2m≤β+γ+1<2*2m. It follows that:
2m−1≤β+γ<2*(2m−1) (17)
and so:
((β+γ+1)mod 2m)=(β+γ+1)−2m=(β+γ)−(2m−1) =((β+γ)mod(2m−1)) (18)
For the complete range (β+γ)ϵ[0,2m+1−3] we have that:
(β+γ)mod(2m−1)=(β+γ+1)mod 2m if β+γ+1≥2m (19)
and otherwise:
(β+γ)mod(2m−1)=(β+γ)mod 2m (20)
It will be appreciated from the above that the sum shown in
In other words, the output y=x mod(2m−1) is given by the bit selection 603 equivalent to taking m bits of the result of the sum shown in
The sum and bit selection shown in
In the exemplary full adder shown in
The adder 105 in
In the case when v mod m=0 and x=2v−1, β+γ=2m+1−2, which lies outside the range [0,2m+1−3]. In this case a multiplier array configured to calculate the sum shown in
For example, returning to the exemplary full adders shown in
An example configuration of the exception logic 102 for use in the case when v modm=0 is shown in
It will be appreciated that in order to represent an input binary integer x as a sum of rows of m-bit binary integers x′ it is not necessary to physically re-order the bits of the input binary integer. Full adders configured to operate on the rows of x′ may logically interpret the bits of x as being represented as a sum of rows of m-bit binary integers x′ and process them as such without any physical reconfiguration of x (e.g. actually splitting up x into m-bit rows in hardware is not required). This is generally true for the input and output values of the elements of full adders described herein: any binary values may be physically manifest in any form; the teaching herein shall be understood to explain the logical operation of full adders and is not intended to limit the possible physical representations of binary values in which binary values are stored, cached or otherwise represented (e.g at registers or memory of a binary circuit).
A full adder provides a low latency and area-efficient solution for calculating y=x mod(2m−1) in binary logic. It is therefore advantageous to make use of full adders 905 to perform modulo operations of the form x[j,k]mod(2n±1) in the binary logic circuit 900 in
In order to form the output div 513, there are
different modulo values to calculate.
In some implementations, a plurality of full adders 905 may be provided at the slice logic 510 such that a full adder exists for each different modulo operation that is to be performed. This enables the modulo operations to be performed at least partly in parallel at the slice logic. Since there will typically be some repetition of modulo operations, it is advantageous however to at design time configure the binary logic circuit so as to include a shared full adder for each different modulo operation and to make use of the outputs from those full adders at the points required in the full adder tree 905. This approach still allows modulo operations to be performed in parallel but avoids duplication of logic and hence saves on chip area. The tree of full adders is configured such that the outputs of the modulo operations performed by the full adders are combined in accordance with equations 7 and 8 or 9 and 10 to form the bit slices of the output 513.
In other implementations not shown in the figures, a single full adder may be provided at the slice logic 510 to sequentially perform the required modulo operations and generate the bit slices of the output div 513 slice-by-slice. In such implementations, the slice logic may comprise state logic to process the results of the modulo operations into bit slices of output 513 and to cause the full adder to sequentially receive the correct bit selections of input x.
The structure of the tree of full adders 905 may be optimised to minimise delay on the critical path and/or to minimise chip area consumed by the circuit. This may be achieved, for example, through appropriate selection of the number and interconnection of full adders so as to control factors such as the degree of parallelisation and sharing of the full adders at the tree 905. As for the one-hot encoding example, the full adder tree could be configured to have a depth which is logarithmic or linear with respect to its number of inputs, or somewhere between the two.
The use of full adders has a further advantage in that many of the full adder stages (see 301-303 in
Each full adder row may be identical to each other full adder row in the logic tree. Each full adder in a full adder row may be identical to each other full adder in that row, apart from the full adder operating on the most significant bit whose carry bit wraps around to the least significant output.
Full adder logic trees which are linear in depth tend to be smaller but can suffer from higher latency than logarithmic circuit designs. Logic trees may have a hybrid structure that is intermediate in depth between linear and logarithmic depth trees.
The reduction cells of a full adder may be arranged in a tree having a depth of linear order as follows. A reduction cell is provided to reduce the most significant 3 n-bit rows of the relevant bit portion of x to 2 n-bit rows (with the carry bit wrapping around in the manner described above). The next most significant n-bits of the bit portion of x row may then be combined with these 2 n-bit rows to produce 2 new n-bit rows. These 2 rows are then combined with the next most significant n-bits of the bit portion of x, and so on until the least significant row has entered a full adder producing the final 2 n-bit rows. The sum of each of the n-bit rows is equal to the result of the modulo operation of the form x[j,k]mod(2n−1) which the full adder is configured to perform.
It follows that, for r rows, (r−2)*n full adders are required to produce all of the required modulo values in order to calculate all the slices of div. The (r−2)*n full adders remove (r−2)*n bits, reducing the initial r*n bits of the input (r n-bit rows) down to 2 n-bit rows. This arrangement of full adders is analogous to the carry tree used in a ‘ripple carry adder’.
Possible logical arrangements of the reduction cells of a full adder having a depth of order log(r) are shown in
Each full adder is marked in the figure by the number of rows of n-bits in the input bit portion. Each of the inputs to the full adders therefore differs by n-bits in accordance with equation 7. Each node in the figure represents an n-bit row, with the rows carrying the most significant bits at the top and the least significant bits at the bottom of the vertical representations of the rows. The most significant rows may be reduced first. Each full adder example in
In
The result of the modulo operation x[j, k]mod(2n−1) is the sum of the outputs of the full adders each configured to reduce a bit selections of x for a given i in equation 7. In other words, if the longest bit selection is of length r, then the reductions of the bit selections ranging from length r down to 1 are summed together to give the output of the modulo operation. For example, if n=2 and the longest bit selection from x is 22 bits, then the greatest number of rows to be reduced at a full adder is and the output of the modulo operation may be calculated by summing the outputs of the 11 full adders shown in the example of
Many of the reduction cells belonging to different full adders can be shared, with the total number of reduction cells being required to produce all the necessary modulo signals to calculate the division slices being of the order of r*log(r). The reduction cells which can be shared between different full adders because they operate on the same set of input bits of x are shaded in
The full adder tree approach to performing division according to the principles described herein of reducing the division operation to a set of modulo calculations may be further extended to division of the form:
For integer p,r and odd integer q (unrelated to the p,q,r used above). This is because there will exist integers a and b where:
a*q=2b−1
Such that:
Where z=(a*p)*x+(a*r). Replacing x with z and n with b in equations 7 and 8 or 9 and 10 above therefore enables more complex division to be similarly reduced to a set of modulo operations and hence benefit from analogous improvements in speed and chip area consumption. Using the principles described herein in such a general case is especially useful in saving chip area in comparison to conventional multiply-add schemes which tend to produce arrays having excessive repetition of logic operations.
It will be appreciated that the full adder tree approach may be extended to divisors of the form d=2n+1, as well as to division operations on signed inputs x.
In the case d=2n−1 when the input x is signed, it is sufficient to consider only whether the most significant bit is signed. This can be achieved by calculating (−2w−1 mod 2n−1)=2n−1−(2w−1 mod 2n−1)=2n−2(w−1)mod n−1 rather than (2w−1 mod 2n−1)=2(w−1)mod n. Where −2w−1 is the value of the sign bit, which is the most significant bit in the signed input x. This is equivalent to left-appending the sign bit to the most significant n-bit row (e.g. if the sign bit is 0, fill the remaining significant bits in the n-bit row with zeros; if the sign bit is 1, fill the remaining significant bits with ones) and, subsequent to the reduction operation on the group of most significant rows, decrementing the output of that reduction by 1 to form a modified output for use in subsequent reduction stages. This decrement increases the delay/area compared to the unsigned case but is an efficient way of handling the signed case and no exception logic is required, even for the case wmod n=0. This approach ensures that the appropriate sign bit is provided as the most significant bit of the output div.
The decrementing of each row by the sign bit may be deferred to the modulo slice or div slice calculation stage. This approach can introduce the overhead that the decrement must be performed for each slice and hence the slice logic may become more complex. However, the timing characteristics of the slice logic may potentially be improved since the decrement could be merged in with the addition(s) required to calculate each slice of div.
An alternative approach to handling signed inputs may be to treat the input x as unsigned and to negate the value of the sign bit from the calculated div. For example, the value of the sign-bit
may be negated from div and (if present) 2w mod d may be subtracted from the modulo output.
For unsigned x for the case w=64 and d=3,
For unsigned x for the case w=64 and d=255,
For unsigned x for the case w=48 and d=21,
d=2n+1 case
A variant of the full adder reduction explained above with reference to
and modulo[n:0]=x mod (2n+1).
For simplicity in the following explanation, m is arranged to be a multiple of n with m=t*n and t>2. Thus
This can be achieved by appending ‘0’s to the left of x appropriately. It will be appreciated that in practice a binary logic circuit need not append such zeros and may be configured to deal with bit strings of differing lengths in any suitable manner. The cases in which t=1,2 may be trivially derived from the more general case described below.
The following approach may be taken to performing division by d=2n+1 at slice logic 510. Note that the values set out below are logical values and neither the values themselves nor the described arrangements of the values need physically exist at the full adders of a suitably adapted full adder tree.
The principle is to calculate all values of the following form for i=0, . . . , t−1:
partial_mod(m−1,i*n)=x[m−1:i*nmod (2n+1)
These values represent the output of modulo calculations performed on part of x and which can be combined to form the output div and modulo values. These values may be calculated at groups of one or more full adders in a full adder tree 905 at slice logic 510 of a binary logic circuit. Each full adder could be, for example, a full adder 100 or 200 as described above with respect to
For the case i=t−1:
partial_mod(m−1:(t−1)*n)=x[m−1:(t−1)*n]
since
x[m−1:(t−1)*n]ϵ[0,2n−1]
and so
x[m−1:(t−1)*n]=x[m−1:(t−1)*n]mod(2n+1) trivially.
For the case i=t−2:
This subtraction, modulo 2n+1, is equal to partial_mod(m−1:(t−2)*n). The subtraction value can be anywhere in the range [−2n+1,2n−1]. If the subtraction is 0 or positive, then:
x[(t−1)*n−1:(t−2)*n]−x[m−1:(t−1)*n]=partial_mod(m−1:(t−2)*n).
If the subtraction is negative then:
x[(t−1)*n−1:(t−2)*n]−x[m−1:(t−1)*n]+(2n+1)=partial_mod(m−1:(t−2)*n).
Both x[(t−1)*n−1:(t−2)* n]−x[m−1:(t−1)*n] and x[(t−1)*n−1:(t−2)*n]−x[m−1:(t−1)*n]+(2n+1) can be advantageously calculated concurrently according to the principles of a compound adder as described above with reference to
For 0≤i<(t−2):
These calculations can be done efficiently using a full adder tree 905 configured to reduce rows of bits of x at independent ‘signed’ full adders down to 2 n-bit rows. The configuration and ‘signing’ of full adders is described below.
In a similar manner to that shown in
x[m−1:(t−1)*n],x[(t−1)*n−1:(t−2)*n], . . . ,x[2*n−1:n],x[n−1:0]
Each of the t groups of bits logically represents a row in a full adder of the full adder tree 905. Each full adder receives three rows for reduction down to two rows. In the present example, the rows are arranged in order of the significance of the bits comprised in each row—e.g. one can think of the row comprising the most significant bit values at the top, followed by the second most significant and so on until the least significant. This is the same arrangement as described in relation to
A “signed” full adder described herein operates on 3 adjacent rows so as to reduce those rows down to 2 rows in the general manner shown in
It is advantageous to fix the sign of the most significant row of bits of x across each of the partial_mod calculations performed on the bit selections x[m−1:i*n]. This straightforwardly allows the outputs of the full adders to be shared because the same reductions are being performed in respect of different partial_mod calculations.
The set of rows of alternating sign representing x can be reduced to 2 rows of different signs (the final pair of rows are always positive and negative) using n independent ‘signed’ full adders. A simple modification of the logic of the full adder tree 905 is required to perform the logical negation of each carry bit.
The alternating sum modulo (2n+1) evaluated by a full adder is equal to partial_mod(m−1:i*n).
As each ‘signed’ full adder operates on the bit columns (see
sum=a XOR b XOR c
But the carry bit is given by:
carry=(a AND b) OR (b AND not(c)) OR (not(c) AND a)
This differs from a non-signed full adder in that the input c is logically negated in the carry output.
Thus a full adder having logic to add together the values of n bit columns can be used to convert 3 rows with sign +,−,+ to a pair of bit rows of sign +,− with a positive carry-bit in the nth column and a gap in the positive row in the 0thcolumn, or to convert 3 rows with signs −,+,− to a pair of bit rows of sign −,+ with a negative carry in the nth column and a gap in the negative row in the 0th column.
The structure of the ‘signed’ full adder tree may be identical to the d=2n−1 case, where it is possible to have a logarithmic depth tree with more nodes or a linear depth tree with fewer nodes. For example, the logarithmic depth tree structure shown in
The values of partial_mod(m−1:i*n) for 0≤i<t can be used to calculate (t−1) n-bit division slices of div[(i+1)*n−1:i*n] in the following way:
From above,
and so:
This final ternary statement can be calculated by performing the following sum(i) value (where the ‘&’ symbol stands for concatenation of binary numbers) and then right-shifting the result by n:
This works because (x[(i+1)*n−1:i*n] partial—mod(m−1:(i+1)*n))ϵ[−2n, 2n−1] can be represented by a signed (n+1) bit number. The sign-bit is in the nth column (measuring the least significant as 0), so if its value is negative, the sign-bit will cause a decrement of partial—mod(m−1:(i+1)*n) in the top n-bits and if positive or zero, the sign-bit will have no effect on the top n-bits. Thus, right-shifting the bottom n-bits away leaves partial—mod(m−1:(i+1)*n)−1 or partial—mod(m−1:(i+1)*n) dependant on the signage of (x[(i+1)*n−1:i*n] partial_mod(m−1:(i+1)*n)) as required.
Let a[n−1:0] be the final ‘signed’ full adder row reduced positive row and b[n−1:0] be the negative row. We have:
Since both (a−b)ϵ[−2n+1,2n−1] and (a−b−1)ϵ[−2n,2n−2] can be represented by (n+1)-bit signed numbers, a similar compound adder setup to that explained above with reference to
In this manner, and with limited modification in order to introduce the signing of rows in the full adders, perform logical negation of carry bits and form of the above sum, division by 2n+1 may be performed at the slice logic of a binary logic circuit as described herein with reference to the figures.
A particular example will now be given of the calculations which may be performed by a binary logic circuit comprising a compound full adder tree configured to perform division by d=23+1=9 in the manner described above. In this example, x is an unsigned 16-bit integer (x[15:0]) and the outputs of the binary logic circuit are the unsigned integers
Let the input be x=“1001010100110111”=38199. We then expect the outputs div[12:0]=“1000010010100”=4244 and modulo[3:0]=“0011”=3.
Firstly, append 2 constant ‘0’s to the most significant bits of x to make it 18 (divisible by 3) rather than 16 bits in length. This gives us
rows, which are, starting with the most significant:
x[17:15]=“001”
x[14:12]=“001”
x[11:9]=“010”
x[8:6]=“100”
x[5:3]=“110”
x[2:0]=“111”
Note that a binary logic circuit configured to operate on an input of length 16 may or may not append 2 zeros as has been done here to simplify the example. Typically, and as will be appreciated by a person skilled in the art of binary logic circuit design, the hardware circuit would not be designed so as to require additional inputs in order to minimise circuit area and complexity.
Firstly, when i=5 or 4 a full adder is not required since no reduction is necessary.
The case i=5 is the first three bits of x:
partial_mod(17:15)=x[17:15]=“001”
Since modulo 23+1 values in general lie in the range [0,8], for simplicity this value may be expressed as a 4-bit value as “0001” for consistency with the other partial_mod values.
Addition logic may be provided at the slice logic to perform the case i=4:
Since “001”−“001”≥0 then no correctional addition of 23+1 is required so partial_mod(17:12)=“001”−“001”=“0000”.
‘Signed’ full adders may be used to evaluate partial_mod for i=3,2,1,0:
When i=3:
partial_mod(17:9)=(−“010”+“101”−1)≥0?(−“010”+“101”−1):(−“010”+“101”−1+“1001”)=“0010”
Which as a full adder reduction of 3 rows down to 2 rows can be expressed as:
Which requires one full adder and one stage. Note the addition of −1 because t−i is odd. Note also that the most significant row is allocated a positive sign and this sign is maintained for this row for subsequent i so as to allow subsequent partial_mod calculations to use the outputs of full adders arranged to calculate partial_mod for lower i.
When i=2:
Which requires two full adders and two stages. Each arrow represents a reduction by a full adder of the signed rows. Rows which are not subject to the reduction pass onto the next stage as a remainder row. Note the signs of the two output rows are swapped because t−i is even.
When i=1:
Which requires three full adders and three stages. Note the addition of −1 because t−i is odd.
When i=0:
partial_mod(17:0)=(−“010”+“101”)≥0?(−“010”+“101”):(−“010”+“101”+“1001”)=“0011”=3=modulo[3:0]
as expected for this example input.
Which requires four full adders and three stages. Note the signs of the two output rows are swapped because t−i is even.
The total number of full adders indicated above is 10. However, there are only 6 unique sets of input rows and it is therefore possible to configure the full adder tree within which the full adders are comprised to share the results between full adders such that only 6 full adders are required. This is advantageous since it minimises circuit area and complexity.
Since t=6, an even number, t−i is even when i is even. Thus the two output rows for the cases t=2 and 4 should have their signs to be swapped (rule 4 above). The output rows for the cases i=1 and 3 require a minus 1 correction when calculating the partial_mod values below (rule 5 above).
The calculation of the (t−1)=5 div slices can now be performed given these values of partial_mod:
Appropriately concatenating all of these together we have:
div[14:12]&div[11:9]&div[8:6]&div[5 3]&div[2:0]=“001000010010100”
Due to the length of the input x being 16, the length of the div output will be 12 after division by 23+1. The top two bits of this concatenation can therefore be removed which were guaranteed to be zeros due to the initial appending of two zeros to the top of x to make its length divisible by 3. Note that the initial appending of zeros to x is to aid ease of explanation and it is preferred that a binary logic circuit does waste additional logic on their presence.
Doing this, we finally get:
div[12:0]=“1000010010100”=4244
as expected for this example input.
Typically, integrated circuits are initially designed using software (e.g. Synopsys(R) Design Compiler) that generates a logical abstraction of the desired integrated circuit. Such an abstraction is generally termed register-transfer level or RTL. Once the logical operation of the integrated circuit has been defined, this can be used by synthesis software (e.g. Synopsys(R) IC Compiler) to create representations of the physical integrated circuit. Such representations can be defined in high level hardware description languages, for example Verilog or VHDL and, ultimately, according to a gate-level description of the integrated circuit. Where logic for calculating a division operation
is required, design software may be configured to use logic configured according to the principles described herein. This could be achieved, for example, by introducing into the integrated circuit design register transfer level (RTL) code defining a binary logic circuit according to any of the examples described herein and shown in the figures.
The binary logic circuits illustrated in the figures are shown as comprising a number of functional blocks. This is schematic only and is not intended to define a strict division between different logic elements of such entities. Each functional block may be provided in any suitable manner. It is to be understood that intermediate values described herein as being formed by components of a binary logic circuit need not be physically generated by the binary logic circuit at any point and may merely represent logical values which conveniently describe the processing performed by the binary logic circuit between its input and output.
Generally, any of the functions, methods, techniques or components described above can be implemented in software, firmware, hardware (e.g., fixed logic circuitry), or any combination thereof. The terms “module,” “functionality,” “component”, “element”, “unit”, “block” and “logic” may be used herein to generally represent software, firmware, hardware, or any combination thereof. In the case of a software implementation, the module, functionality, component, element, unit, block or logic represents program code that performs the specified tasks when executed on a processor. The algorithms and methods described herein could be performed by one or more processors executing code that causes the processor(s) to perform the algorithms/methods. Examples of a computer-readable storage medium include a random-access memory (RAM), read-only memory (ROM), an optical disc, flash memory, hard disk memory, and other memory devices that may use magnetic, optical, and other techniques to store instructions or other data and that can be accessed by a machine.
The terms computer program code and computer readable instructions as used herein refer to any kind of executable code for processors, including code expressed in a machine language, an interpreted language or a scripting language. Executable code includes binary code, machine code, bytecode, code defining an integrated circuit (such as a hardware description language or netlist), and code expressed in a programming language code such as C, Java or OpenCL. Executable code may be, for example, any kind of software, firmware, script, module or library which, when suitably executed, processed, interpreted, compiled, executed at a virtual machine or other software environment, cause a processor of the computer system at which the executable code is supported to perform the tasks specified by the code.
A processor, computer, or computer system may be any kind of device, machine or dedicated circuit, or collection or portion thereof, with processing capability such that it can execute instructions. A processor may be any kind of general purpose or dedicated processor, such as a CPU, GPU, System-on-chip, state machine, media processor, an application-specific integrated circuit (ASIC), a programmable logic array, a field-programmable gate array (FPGA), or the like. A computer or computer system may comprise one or more processors.
It is also intended to encompass software which defines a configuration of hardware as described herein, such as HDL (hardware description language) software, as is used for designing integrated circuits, or for configuring programmable chips, to carry out desired functions. That is, there may be provided a computer readable storage medium having encoded thereon computer readable program code in the form of an integrated circuit definition dataset that when processed in an integrated circuit manufacturing system configures the system to manufacture a binary logic circuit configured to perform any of the methods described herein, or to manufacture a binary logic circuit comprising any apparatus described herein. An integrated circuit definition dataset may be, for example, an integrated circuit description.
An integrated circuit definition dataset may be in the form of computer code, for example as a netlist, code for configuring a programmable chip, as a hardware description language defining an integrated circuit at any level, including as register transfer level (RTL) code, as high-level circuit representations such as Verilog or VHDL, and as low-level circuit representations such as OASIS (RTM) and GDSII. Higher level representations which logically define an integrated circuit (such as RTL) may be processed at a computer system configured for generating a manufacturing definition of an integrated circuit in the context of a software environment comprising definitions of circuit elements and rules for combining those elements in order to generate the manufacturing definition of an integrated circuit so defined by the representation. As is typically the case with software executing at a computer system so as to define a machine, one or more intermediate user steps (e.g. providing commands, variables etc.) may be required in order for a computer system configured for generating a manufacturing definition of an integrated circuit to execute code defining an integrated circuit so as to generate the manufacturing definition of that integrated circuit.
An example of processing an integrated circuit definition dataset at an integrated circuit manufacturing system so as to configure the system to manufacture a binary logic circuit will now be described with respect to
The layout processing system 1104 is configured to receive and process the IC definition dataset to determine a circuit layout. Methods of determining a circuit layout from an IC definition dataset are known in the art, and for example may involve synthesising RTL code to determine a gate level representation of a circuit to be generated, e.g. in terms of logical components (e.g. NAND, NOR, AND, OR, MUX and FLIP-FLOP components). A circuit layout can be determined from the gate level representation of the circuit by determining positional information for the logical components. This may be done automatically or with user involvement in order to optimise the circuit layout. When the layout processing system 1104 has determined the circuit layout it may output a circuit layout definition to the IC generation system 1106. A circuit layout definition may be, for example, a circuit layout description.
The IC generation system 1106 generates an IC according to the circuit layout definition, as is known in the art. For example, the IC generation system 1106 may implement a semiconductor device fabrication process to generate the IC, which may involve a multiple-step sequence of photo lithographic and chemical processing steps during which electronic circuits are gradually created on a wafer made of semiconducting material. The circuit layout definition may be in the form of a mask which can be used in a lithographic process for generating an IC according to the circuit definition. Alternatively, the circuit layout definition provided to the IC generation system 1106 may be in the form of computer-readable code which the IC generation system 1106 can use to form a suitable mask for use in generating an IC.
The different processes performed by the IC manufacturing system 1102 may be implemented all in one location, e.g. by one party. Alternatively, the IC manufacturing system 1102 may be a distributed system such that some of the processes may be performed at different locations, and may be performed by different parties. For example, some of the stages of: (i) synthesising RTL code representing the IC definition dataset to form a gate level representation of a circuit to be generated, (ii) generating a circuit layout based on the gate level representation, (iii) forming a mask in accordance with the circuit layout, and (iv) fabricating an integrated circuit using the mask, may be performed in different locations and/or by different parties.
In other examples, processing of the integrated circuit definition dataset at an integrated circuit manufacturing system may configure the system to manufacture a binary logic circuit without the IC definition dataset being processed so as to determine a circuit layout. For instance, an integrated circuit definition dataset may define the configuration of a reconfigurable processor, such as an FPGA, and the processing of that dataset may configure an IC manufacturing system to generate a reconfigurable processor having that defined configuration (e.g. by loading configuration data to the FPGA).
In some embodiments, an integrated circuit manufacturing definition dataset, when processed in an integrated circuit manufacturing system, may cause an integrated circuit manufacturing system to generate a device as described herein. For example, the configuration of an integrated circuit manufacturing system in the manner described above with respect to
In some examples, an integrated circuit definition dataset could include software which runs on hardware defined at the dataset or in combination with hardware defined at the dataset. In the example shown in
The implementation of concepts set forth in this application in devices, apparatus, modules, and/or systems (as well as in methods implemented herein) may give rise to performance improvements when compared with known implementations. The performance improvements may include one or more of increased computational performance, reduced latency, increased throughput, and/or reduced power consumption. During manufacture of such devices, apparatus, modules, and systems (e.g. in integrated circuits) performance improvements can be traded-off against the physical implementation, thereby improving the method of manufacture. For example, a performance improvement may be traded against layout area, thereby matching the performance of a known implementation but using less silicon. This may be done, for example, by reusing functional blocks in a serialised fashion or sharing functional blocks between elements of the devices, apparatus, modules and/or systems. Conversely, concepts set forth in this application that give rise to improvements in the physical implementation of the devices, apparatus, modules, and systems (such as reduced silicon area) may be traded for improved performance. This may be done, for example, by manufacturing multiple instances of a module within a predefined area budget.
The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in the light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
1618286.7 | Oct 2016 | GB | national |
Number | Name | Date | Kind |
---|---|---|---|
5012439 | Nash | Apr 1991 | A |
20020116679 | Lei et al. | Aug 2002 | A1 |
20030131035 | Kurd | Jul 2003 | A1 |
20030149713 | Kurd | Aug 2003 | A1 |
20180121166 | Rose | May 2018 | A1 |
Number | Date | Country |
---|---|---|
868840 | May 1961 | GB |
Entry |
---|
Petry et al; “Division Techniques Integers of the Form 2N 1”; International Journal of Electronics, Taylor and Francis. Ltd. London, GB; vol. 74; No. 5; May 1, 1993; pp. 659-670. |
Artzy et al; “A Fast Division Technique for Constant Divisors” Communications of the ACM, Association for Computing Machinery, Inc, United States; vol. 19; No. 2; Feb. 1, 1976; pp. 98-101. |
Srinivasan et al; “Constant-Division Algorithms”; IEE Proceedings: Computers and Digital Techniq, IEE, GB; vol. 141; No. 6; Nov. 1, 1994; pp. 334-340. |
Sivakumar et al., “VSLI Design of a Modulo-Extractor,” University of Victoria, Electrical and Computer Engineering, IEEE Pacific Rim Conference, May 1991. |
Number | Date | Country | |
---|---|---|---|
20180121166 A1 | May 2018 | US |