The present technique relates to the field of data processing.
A processor may have logic circuitry for implementing various arithmetic or logical operations. One arithmetic operation to be supported by a processor may be a multiplication operation. While the arithmetic operation for the multiplication operation is well-defined, there is design choice to be made in how to implement hardware circuit logic for performing that operation within a processor. Design decisions made by the circuit designer may have an impact on processing performance and/or energy efficiency.
At least some examples of the present technique provide multiplication circuitry comprising:
a plurality of adder arrays each to add a respective set of partial products to generate a respective product representing value representing a result of multiplication of a respective pair of portions of bits selected from a first operand and a second operand, the plurality of adder arrays comprising separate instances of hardware circuitry, the plurality of adder arrays having at least two separate enable control signals for independently controlling whether at least two subsets of the adder arrays are enabled or disabled;
shared Booth encoding circuitry, shared between the plurality of adder arrays, to Booth encode the first operand to generate a plurality of partial product selection indicators, each partial product selection indicator corresponding to a Booth encoding of a respective Booth digit of the first operand; and
partial product selection circuitry to select, based on the second operand and the plurality of partial product selection indicators, the partial products to be added by the plurality of adder arrays; in which:
at least two of the adder arrays are configured to operate on respective partial products selected by the partial product selection circuitry based on a same partial product selection indicator generated by the shared Booth encoding circuitry based on the Booth encoding of a same Booth digit of the first operand.
At least some examples of the present technique provide an apparatus comprising: processing circuitry to perform data processing in response to instructions, the processing circuitry comprising the multiplication circuitry described above.
At least some examples of the present technique provide a method comprising:
using shared Booth encoding circuitry shared between a plurality of adder arrays, Booth encoding a first operand to generate a plurality of sets of partial product selection indicators, each partial product selection indicator corresponding to a Booth encoding of a respective Booth digit of the first operand, wherein the plurality of adder arrays comprise separate instances of hardware circuitry, the plurality of adder arrays having at least two separate enable control signals for independently controlling whether at least two subsets of the adder arrays are enabled or disabled;
selecting, based on a second operand and the plurality of partial product selection indicators, respective sets of partial products to be added by a plurality of adder arrays; and
adding the partial products using enabled adder arrays of the plurality of adder arrays to generate respective product representing values each representing a result of multiplication of a respective pair of portions of bits selected from the first operand and the second operand,
wherein at least two of the adder arrays are configured to operate on respective partial products selected based on a same partial product selection indicator generated by the shared Booth encoding circuitry based on the Booth encoding of a same Booth digit of the first operand.
At least some examples of the present technique provide a non-transitory computer-readable medium to store computer-readable code for fabrication of multiplication circuitry comprising:
a plurality of adder arrays each to add a respective set of partial products to generate a respective product representing value representing a result of multiplication of a respective pair of portions of bits selected from a first operand and a second operand, the plurality of adder arrays comprising separate instances of hardware circuitry, the plurality of adder arrays having at least two separate enable control signals for independently controlling whether at least two subsets of the adder arrays are enabled or disabled;
shared Booth encoding circuitry, shared between the plurality of adder arrays, to Booth encode the first operand to generate a plurality of partial product selection indicators, each partial product selection indicator corresponding to a Booth encoding of a respective Booth digit of the first operand; and
partial product selection circuitry to select, based on the second operand and the plurality of partial product selection indicators, the partial products to be added by the plurality of adder arrays; in which:
at least two of the adder arrays are configured to operate on respective partial products selected by the partial product selection circuitry based on a same partial product selection indicator generated by the shared Booth encoding circuitry based on the Booth encoding of a same Booth digit of the first operand.
At least some examples provide a system comprising: the multiplication circuitry described above, implemented in at least one packaged chip; at least one system component; and a board, wherein the at least one packaged chip and the at least one system component are assembled on the board.
At least some examples provide a chip-containing product comprising the system described above assembled on a further board with at least one other product component.
Further aspects, features and advantages of the present technique will be apparent from the following description of examples, which is to be read in conjunction with the accompanying drawings.
Multiplication circuitry comprises two or more adder arrays each to add a respective set of partial products to generate a respective product representing value representing a result of multiplication of a respective pair of portions of bits selected from a first operand and a second operand. Each adder array comprises a separate instance of hardware circuitry. The adder arrays have at least two separate enable control signals for independently controlling whether at least two subsets of adder arrays are enabled or disabled. Providing respective subsets of adder array each with independent enable/disable control, with each adder array being capable of providing a respective product representing value representing the result of multiplication of a respective pair of portions of bits from the first and second operands, can be useful to support multiplication operations on portions of the first operand and second operand corresponding to different data types, for example.
The partial products added by each adder array are selected based on the second operand and based on partial product selection indicators generated by Booth encoding circuitry. Booth encoding is a technique used in multiplication circuits to reduce the number of partial products which need to be added to produce a multiplication result. Booth encoding is based on treating the first operand as a series of overlapping Booth digits (each Booth digit comprising a certain number of bits of the first operand) and analyzing the bit pattern of each Booth digit to identify positions at which runs of successive 1s start/end in the first operand. For a given Booth digit of the first operand, a corresponding partial product selection indicator is generated which represents the Booth encoding of the Booth digit. The partial product selection indicator controls partial product selection circuitry to select a corresponding partial product to be added by one of the adder arrays. For example, a given partial product may be a selected multiple of a portion of the second operand, with the corresponding partial product selection indicator used to select between different multiples of that portion of the second operand.
Where a number of independently enabled/disabled subsets of adder arrays are provided for generating respective product representing values, a typical approach would be to provide each adder array with its own private instance of Booth encoding circuitry, not shared with the Booth encoding circuitry used by other arrays.
However, the inventor recognised that there is an opportunity to save circuit area and power consumption based on a recognition that the different adder arrays may be Booth encoding the same operand or the same subsections of that operand, so that it is possible to share the Booth encoding circuitry between the adder arrays. Hence, shared Booth encoding circuitry is provided, that is shared between the plurality of adder arrays, to Booth encode the first operand to generate the partial product selection indicators, each partial product selection indicator corresponding to a Booth encoding of a respective Booth digit of the first operand. At least two of the adder arrays are configured to operate on respective partial products selected by the partial product selection circuitry based on a same partial product selection indicator generated by the shared Booth encoding circuitry based on the Booth encoding of a same Booth digit of the first operand. By providing shared Booth encoding circuitry, overall circuit area and power consumption of the multiplication circuitry can be reduced. This technique can also help to reduce pressure on meeting circuit timings, due to not needing to retain as many bits for the partial product selection indicators compared to an implementation with private non-shared Booth encoders for each adder array.
Each adder array generates a respective product representing value, which indicates the numerical result of a respective pair of portions of bits selected from a first operand and a second operand. Hence, one adder array may generate the product representing value representing the product of a first portion of the first operand and a corresponding first portion of the second operand. Another adder array may generate the product representing value representing the product of a second portion of the first operand (which could be overlapping or non-overlapping with the first portion) and a corresponding second portion of the second operand. Each product representing value may represent the numerical result of an A-bit*B-bit multiplication of the corresponding portions of bits taken from the first and second operands respectively (where A can be equal to B or A can be greater than or less than B, and the values of A and B can be the same for two or more adder arrays or could differ between one adder array and another). Each product representing value represents the numerical result of the A-bit*B-bit multiplication (so is not merely an intermediate step representing a partial sum of only some of the partial products corresponding to the A-bit*B-bit multiplication). However, the numerical result can be represented in different forms. For example, the product representing value could represent the numerical result in a carry-save format using a sum term and a carry term. For example, the adder array could comprise a carry-save-adder tree which does not propagate carries from one bit lane to another, but performs a series of 3:2 carry-save add reductions to reduce the set of partial products to a sum term and a carry term which when added together by a further carry propagate adder (not part of the adder arrays) may produce a number representing the numeric result of the A-bit*B-bit multiplication in binary representation. Hence, it will be appreciated that the output of each adder array could represent the product representing value in a carry save form.
The multiplication circuitry is configured to support at least two data element size configurations for multiplication of one or more respective pairs of data elements selected from the first operand and the second operand, each data element size configuration corresponding to a different combination of data element sizes for the data elements selected from the first operand and the second operand. For example, the data element size configurations can correspond to different SIMD element sizes for a SIMD (single instruction, multiple data) multiplication operation. For example, the first and second operands could be vector operands and each data element could be a respective vector element. In another example, the first and second operands could be matrix operands and each data element could be a respective matrix element.
For example, each individual adder array may be sized appropriate for a specific data element size configuration, and selection of which adder arrays are enabled/disabled may depend on the current element size configuration in use. Hence, enable control circuitry may select, based on the current data element size configuration to be used for the multiplication operation, which of the adder arrays are to be enabled and which of the adder arrays are to be disabled. Use of at least two subsets of adder arrays with independent enable/disable control can provide greater energy efficiency for implementing multiplications with different data element size configurations, compared to an implementation which provides a single large adder array and relies on injecting 0s for some of the partial product bits when performing multiplications with a data element size smaller than the maximum supported size.
For the Booth encoding of a given Booth digit of the first operand: for at least one of the data element size configurations for which the given Booth digit spans a data element boundary, the shared Booth encoding circuitry may generate the Booth encoding of the given Booth digit based on a least significant bit of the given Booth digit being set to 0; and for at least one other of the data element size configurations for which the given Booth digit does not span a data element boundary, the shared Booth encoding circuitry may generate the Booth encoding of the given Booth digit based on the least significant bit of the given Booth digit being set to a value of a corresponding bit of the first operand. Hence, by providing, in the shared Booth encoding circuitry, selection circuitry to select whether the least significant bit of the given Booth digit is treated as 0 or the corresponding bit of the first operand, this allows the same Booth encoding circuitry to be shared between the different adder arrays, even when those adder arrays are based on data element size configurations which mean that the given Booth digit of the first operand will span the element boundary for one configuration but not for another.
The adder arrays may comprise at least:
For the sharing of the Booth encoder between adder arrays, the at least two adder arrays described above (those adder arrays for which the respective partial products are selected based on the same partial product selection indicator) may comprise an adder array of the first subset and an adder array of the second subset. This may reflect that in the first/second data element size configurations, the first/second subsets of adder arrays operate on the same subsections of the first operand (although those subsections may be logically divided into different data element configurations), so regardless of which data element configuration is active, the same hardware circuit logic can be used to analyze the bit pattern of a given Booth digit of the first operand and provide a corresponding partial product selection indicator used by the partial product selection circuitry to select respective partial products for both the adder array of the first subset and the adder array of the second subset.
Although the same partial product selection indicator is used for both an adder array of the first subset and an adder array of the second subset, the partial product selected for the adder array of the first subset based on that same partial product selection indicator may be dependent on a first subset of bits of the second operand when the multiplication is to be performed according to the first data element size configuration, and the partial product selected for the adder array of the second subset based on that same partial product selection indicator may be dependent on a second subset of bits of the second operand when the multiplication is to be performed according to the second data element size configuration. Hence, although the same partial product selection indicator is used for respective adder arrays in both the first subset and the second subset, the value of the corresponding partial products used by those respective adder arrays in the first/second subsets can still be different because those partial products may be based on different portions of the second operand.
Enable control circuitry may be configured to set one or more enable control signals for the first subset of adder arrays to disable the first subset of adder arrays, in response to current data element size configuration information specifying that the first data element size configuration is not needed for a given multiplication operation performed on the first operand and the second operand; and set one or more enable control signals for the second subset of adder arrays to disable the second subset of adder arrays, in response to the current data element size configuration information specifying that the second data element size configuration is not needed for the given multiplication operation. Hence, the first subset of adder arrays may be enabled when the first data element size configuration is required, and the second subset of adder arrays may be enabled when the second data element size configuration is required. If one of the first/second data element size configuration is not required, the corresponding one of the first/second subsets of adder arrays can be disabled to save power.
The enable control circuitry may support performing, in parallel, multiplication operations on the first and second operands using both the first data element size configuration and the second data element size configuration, by enabling both the first subset of adder arrays and the second subset of adder arrays using the enable control signals. Although often only one of the data element size configurations may be required at a given time, there can be some use cases where it is useful to process the same first and second operands with more than one different data element size configuration to generate corresponding sets of product results for two or more data element size configurations. For example, this can be useful in some machine learning or graphics processing applications. For example, two vector operands J and K could be processed to generate both a first vector result L providing a set of product representing values each corresponding to the result of multiplying corresponding pairs of 32-bit elements of J and K, and a second vector result M providing a set of product representing values each corresponding to the result of multiplying corresponding pairs of 16-bit elements of J and K. By implementing the multiplication circuitry using a number of subsets of adder arrays each corresponding to a given data element size configuration and each with independent enable/disable control, this can support such parallel multiplication operands performed on the same operands according to more than one data element size configuration.
While the examples discussed refer to a first subset of adder arrays and a second subset of adder arrays, it will be appreciated that the same techniques could be extended to three or more subsets of adder arrays. For example, the plurality of adder arrays may also comprise a third subset of adder arrays for which each respective product representing value represents the result of multiplication of a respective pair of data elements selected from the first operand and the second operand according to a third data element size configuration. For example, the first, second and third subsets of adder arrays could be used for 8-bit, 16-bit, 32-bit multiplications respectively.
The adder arrays can also be used in a cooperative mode where they cooperate to perform respective portions of a larger multiplication. Product adding circuitry may add, for a multiplication operation in which two or more of the adder arrays are to operate in a cooperative mode, the respective product representing values generated by two or more of the adder arrays to generate a product result value representing the multiplication of wider portions of bits of the first operand and the second operand than the portions of bits used to generate the respective product representing values for any one of said two or more of the adder arrays. For example, where respective adder arrays are provided to support at least two of 8-bit, 16-bit and 32-bit data element configurations respectively, those adder arrays may also be mapped to respective portions of the partial product additions required for a larger multiplication operation (e.g. with 64-bit elements). Implementing a larger multiplication operation using a number of independently enabled/disabled smaller adder arrays in this way can provide a more power-efficient way of implementing options for both a smaller-size and larger-size multiplication operation, compared to alternative approaches such as the injection of zeroes into a single monolithic large multiplier as discussed above.
For at least one adder array, the portion of the second operand used to form the partial products for that adder array is variable depending on whether the multiplication operation is to be performed in the cooperative mode. For example, the partial product selection circuitry may select the portion of the second operand used to form the partial products for that adder array as a first portion of the second operand when the multiplication operation is to be performed in a non-cooperative mode (with each adder array operating individually to produce an independent product representing value) and a second portion of the second operand when the multiplication is to be performed in a cooperative mode (with the adder arrays generating respective product representing values which can further be added together by the product adding circuitry to generate the product result value representing the result of the wider multiplication).
In some examples, the wider portions of bits of the first operand and the second operand comprise all magnitude-indicating bits of the first operand and the second operand. For example, where the first/second operands are vector operands based on signed/unsigned integer data types, the cooperative mode may provide full-operand-width multiplication of a number represented by all bits of the first operand and a number represented by all bits of the second operand. Also, for an example where the first/second operands are floating-point operands comprising bits representing a sign (positive or negative), an exponent and a fraction, the magnitude-indicating bits may be the bits of a significand represented by the fraction (so the bits of the first/second operands corresponding to the sign and exponent may not be considered magnitude-indicating bits of the first/second operands).
In other examples, the wider portions of bits of the first operand and the second operand need not comprise all magnitude-indicating bits of the first operand and the second operand.
For example, the cooperative mode can be used to cause adder arrays to work together to compute a multiplication result for a data type which does not correspond to the maximum supported size representable using the first/second operand, but which does not have adder arrays supporting that data type natively. In this case, it may be that only some of the adder arrays need to cooperate to provide the multiplication result for that data type, so some of the adder arrays may not be needed (and so can be disabled using the corresponding enable control signal) even in the cooperative mode, depending on the configuration option selected for a particular multiplication operation.
In one example, the adder arrays comprise:
The plurality of subsets of adder arrays may be the same subsets as the at least two subsets mentioned above that are provided with separate enable control signals.
Hence, in some examples, the support for a larger multiplication in the cooperative mode may require some additional partial products to be added which are not required for any of the data element size configurations supported natively by the plurality of subsets of adder arrays. Hence, a further adder array may be provided (which is disabled in the non-cooperative mode), to compute a further product representing value representing a result of a remaining part of the multiplication of the wider portions being multiplied in the cooperative mode. This further adder array can also share the same Booth encoding circuitry used for the other subsets of adder arrays that are used in the non-cooperative mode.
In some examples, the separate enable control for the adder array may be provided on a subset-by-subset basis, so that each subset of adder arrays corresponding to a given data element size configuration has its own independent enable/disable control, but it is not essential to provide finer grained control to independently enable/disable each individual adder array within a subset. For example, a first subset of adder arrays designed for multiplications based on a first data element size configuration (e.g. 8-bit*8-bit multiplications) could be collectively enabled/disabled based on a single shared enable control signal, with that enable control signal being separate from an enable control signal used to enable/disable a second subset of adder arrays designed for multiplications based on a second data element size configuration (e.g. 16-bit*16-bit multiplications).
However, in other examples each adder array could have its own independent enable control signal, allowing each adder array to be independently enabled/disabled by the enable control signal. This can be useful, for example, for allowing some adder arrays to be disabled (even when the currently selected data element size configuration is the data element size configuration associated with that adder array), for example if a given data element lane of the first/second operands is predicated to disable the effects in that lane. Also, providing independent enable/disable control for the respective adder arrays within a subset of adder arrays designed for a given data element size configuration can also be helpful to allow a set of adder arrays designed to support a certain maximum operand length for the first/second operands to also support multiplications performed on operands of smaller length. For example, while a set of adder arrays could be provided to support 64-bit operands, half of the adder arrays used for a given data element size configuration could be disabled when the multiplication is to be applied to 32-bit operands.
The separate enable control signals for each adder array can be implemented in a variety of ways. For example, power gating could be used to provide separate enable/disable control, by providing the ability to selectively isolate each adder array from a power supply node.
However, in one example, the separate enable control signals comprise separate clock signals. For example, the enable control circuitry described above may clamp the clock signal to a selected adder array to a fixed value when that selected adder array is to be disabled. Holding the clock signal to the selected adder array static may prevent the selected adder array from functioning and may conserve dynamic power. Controlling enabling/disabling of the adder arrays based on whether the corresponding clock signals are allowed to toggle or not can be a more area-efficient way of providing independent enable/disable control for each adder array, as it can require fewer additional logic gates compared to an alternative such a power gating.
An apparatus may comprise processing circuitry to perform data processing in response to instructions; and the processing circuitry comprising the multiplication circuitry described above. For example, the processing circuitry could be a CPU (Central Processing Unit), GPU (Graphics processing unit) or other processing unit within a data processing system (e.g. a Neural Processing Unit provided for performing neural network processing or other machine learning operations).
The execute stage 16 includes a number of processing units, for executing different classes of processing operation. For example the execution units may include an integer or fixed-point arithmetic/logic unit (ALU) 20 for performing arithmetic or logical operations on scalar operands or vector operands read from the register file 14; a floating point unit 22 for performing operations on floating-point values, a branch unit 24 for evaluating the outcome of branch operations and adjusting the program counter which represents the current point of execution accordingly; and a load/store unit 28 for performing load/store operations to access data in a memory system 8, 30, 32, 34. In this example the memory system include a level one data cache 30, the level one instruction cache 8, a shared level two cache 32 and main system memory 34. It will be appreciated that this is just one example of a possible memory hierarchy and other arrangements of caches can be provided. The specific types of processing unit 20 to 28 shown in the execute stage 16 are just one example, and other implementations may have a different set of processing units or could include multiple instances of the same type of processing unit so that multiple micro-operations of the same type can be handled in parallel. It will be appreciated that
One example of an operation which may be supported by the processing circuitry 4 (e.g. within the ALU 20 or the floating point unit 22) is a multiplication operation. In some systems, a dedicated execution unit called a multiply-accumulate (MAC) unit may be provided to handle multiplications since multiply-accumulate operations (where two operands are multiplied and the result is added to an accumulator value). A multiply-accumulate operation may be frequently used in digital signal processing algorithms for example, so any techniques for improving energy efficiency and reducing pressure to meet circuit timings can be extremely helpful.
While examples below discuss a multiplication operation, this is intended to encompass multiply-add or multiply-accumulate operations, so even if a subsequent adder for adding the multiplication result to a third operand is not shown, such a subsequent adder could still be provided. It is also possible to provide standalone multiplication operations which produce a multiplication result without also adding the multiplication result to a third operand.
The multiplication circuitry 50 described below is based on a technique known as Booth multiplication, which is based on the principle that, when multiplying a first value (a multiplicand M) by a second value (a multiplier R) to obtain a multiplication result M*R, within the multiplicand M a string of consecutive binary 1s can effectively be replaced with a +1 at the bit position one place higher than the upper end of the string and a −1 at the bit position corresponding to the lower end of the string, which can help to reduce many of the partial products to zero and so make processor logic implementation more straightforward. This is analogous to 999 in decimal being equivalent to 1000-1. Hence, if considering a multiplication of 999*R, the “schoolbook” long multiplication approach would carry out a series of additions of partial products 900*R+90*R+9*R. With the Booth approach this could be reduced to 1000*R−1*R. Respective overlapping groups of bits (referred to as “Booth digits” below) of the multiplicand M can be analysed to look for patterns representing the start/end of runs of successive 1s, and this can be used to deduce each multiple of R to be selected as a respective partial product to be added to form the product result.
Booth multiplication involves three stages:
The multiplicand M is logically partitioned into a series of overlapping Booth digits each corresponding to a subset of bits of the multiplicand M. For each Booth digit, the Booth encoder analyses the pattern of bits in that Booth digit and outputs, as a Booth encoding of that Booth digit, a partial product selection indicator which indicates which of a number of different multiples of the multiplier R should be selected as a corresponding partial product to be included in the set of partial products added to produce the multiplication result M*R. Different “radix” versions of the Booth encoding scheme can be provided, where the radix indicates how many bits of the multiplicand M are considered in each Booth digit. Neighbouring Booth digits overlap by 1 bit. The least significant Booth digit is padded with a fixed bit of 0b0 at the lower end. The most significant Booth digit is padded with at least one bit above the most significant bit of M. The padding bits correspond to a sign-extension of the multiplicand.
For example, for a Radix-4 Booth multiplication, each Booth digit comprises 3 bits, and neighbouring Booth digits overlap by 1 bit. For example, for an 8-bit multiplicand M having bits M [7:0], the Booth digits may comprise:
The Booth encoder implements hardware circuit logic that, based on the pattern of bits in a given Booth digit, determines which multiple of the multiplier R should be selected for a corresponding partial product. The rules for which multiple to select for a given pattern of bits in a Booth digit are based on whether a run of successive 1s starts or ends within that Booth digit. For example, if all bits are 0 or all bits are 1 within the Booth digit, the multiple to select is 0*R because there is no run of 1s starting or ending within the Booth digit. For Booth digits involving a mix of 0s and 1s, the multiple to select depends on the position where any transition from 0 to 1 or 1 to 0 occurs, so that the multiple implements the combined effect of (i) adding a +R multiple at a bit position one higher than the top bit of any run of successive 1s occurring within the Booth digit and (ii) adding a −R multiple at a bit position corresponding to the bottom bit of any run of successive 1s occurring within the Booth digit. However, as the Booth digits are assessed multiple bits at a time, for Radix-4 or higher-radix Booth encoding, higher multiples of R such as ±2*R are considered to account for the fact that the +R or −R multiple could be injected at different bit positions within the Booth digit. Some worked examples are discussed below for Radix-4 and Radix-8, but it will be appreciated that other radix values could be used. For each Booth digit of the multiplicand M, the Booth encoder outputs a partial product selection indicator indicating which of a set of candidate multiples of the multiplier R should be selected for a corresponding partial product.
The required multiples of the multiplier R are prepared (this can be done in parallel with the Booth encoding). For example, for Radix-4 operations, the multiples of R that could be selected for a given Booth digit are: +2*R, +R, 0, −R, −2*R. For Radix-8 the multiples extend from +4*R to −4*R. Hence, partial product selection circuitry may form the multiple values.
Forming the multiple values may include negation of R for forming the negative multiples, left shifting of R to form power-of-2 multiples such as +2*R and +4*R, and, if the radix is such that a non-power of 2 multiple such as +3*R is required, addition of other multiples (e.g. adding +2*R and +R to form the +3*R multiple).
From among the candidate multiples, for a given Booth digit partial product selection circuitry selects one of the multiples, based on the partial product selection indicator provided by the Booth encoder for that given Booth digit.
The partial products selected for each Booth digit are added together (with an appropriate alignment between the partial products for adjacent Booth digits to account for the relative magnitude of the partial product based on the position where the Booth digit was found within the multiplicand M).
To illustrate Booth multiplication, consider a multiplication M*R where M and R are both 8-bit values and the decimal values corresponding to M and R are M=56 and R=47:
With a traditional “schoolbook” long multiplication method, this can be converted into a series of partial products as follows (where PPi is +R if the corresponding bit i of M is 1 and is 0 if the corresponding bit i of M is 0):
With the schoolbook approach, the run of successive 1s in M causes three partial products PP3, PP4, PP5 to include +R multiples. With the Booth approach, the same result could have been achieved by adding +R for PP6 and adding-R for PP3, but with Radix-2 (which would correspond to Booth digits each comprising 2 bits), this would not enable the number of partial products to be reduced. Hence, most practical Booth implementations use Radix-4 or higher.
For Radix-4, the Booth digits are selected based on the bits of M as explained earlier, and the encoding rules are as follows:
Using the same example of M=56 and R=47, the multiple values available for selection for each partial product are:
Hence, the Booth digits selected from the multiplicand M, and the corresponding partial products selected for each Booth digit according to the encoding rules shown above are as follows:
BD0=M1, M0, M−1=00 (0)->PP0=0
BD1->M3, M2, M1=100->PP1=−2R=10100010
BD2==M5, M4, M3=111->PP2=0
BD3=M7, M6, M5=001->PP3+R=00101111
BD4=SE, SE, M7=000->PP4=0
Adding partial products (shifted by 2 each time to account for relative alignment of each Booth digit) gives:
Hence, the same numeric result can now be achieved while adding only 5 partial products, rather than 9.
Similarly, for Radix-8, the Booth encoding rules are as follows:
Applying this to the same M=56 and R=47 example gives:
Now the result can be achieved in 3 partial products.
Hence, an approach using a higher radix can treat the multiplicand M as including fewer Booth digits and hence require fewer partial products to be added, but this is at the expense of increased complexity in having more options for the multiple selection (which will increase the circuit complexity of generating the multiple values and the multiplexers which select the multiple, as well as the complexity of the Booth encoder).
As shown in
For one of the data element size configurations, the first and second operands may be treated as single data elements to be multiplied together, but for other data element size configurations each of the first and second operands may be logically divided into multiple independent data elements and the product result to be generated may be a vector or matrix comprising a number of result data elements each corresponding to the product of a corresponding pair of data elements of the first and second operands. For a given multiply operation, the data element size configuration information may depend on an immediate operand or register operand of an instruction executed by the processing circuitry 4, and/or based on element size mode information stored within the system register of the processing circuitry 4. The data element size configuration information may vary from one multiplication operation to another.
The shared Booth encoding circuitry 52 Booth encodes the first operand src_a, to generate a set of partial product selection indicators 62 which each correspond to a Booth encoding of a respective Booth digit of the first operand. The Booth encoding is generated based on the bit patterns of the corresponding Booth digit, according to the encoding rules shown for radix-4 or radix-8 above (or alternatively, if higher radix is used for the Booth encoding, according to similar rules for that higher radix).
The partial product selection circuitry 54 selects, based on the second operand src_b and the partial product selection indicators 62, the sets of partial products to be added by each of the adder arrays 56. For example, a first set of partial products “pps 0” is selected for adder array 56-0, a second set of partial products “pps 1” is selected for adder array 56-1, and so on. The partial product selection may also depend on the data element size configuration information (e.g. different portions of the second operand src_b may be used to select the partial products, depending on whether the data element size configuration information indicates use of the cooperative mode or non-cooperative mode described further below). For each partial product, the partial product is a selected multiple of a corresponding portion of the second operand src_b, where that multiple can range from +2*R to −2*R for radix-4 and from +4*R to −4*R for radix-8 (where R is the value corresponding to the selected portion of bits of the second operand src_b that is relevant for a given adder array 56). Different adder arrays 56 may have their partial products selected based on different portions of the second operand src_b. As well as selecting the partial products, the partial product selection circuitry 54 may also include circuitry for generating the multiple values available for selection as the partial products—e.g. including shifting circuitry, negation circuitry, and/or adding circuitry to generate the required+2*R to −2*R or +4*R to −4*R multiples for each portion of the second operand src_b. The circuitry for generating the multiple values based on src_b may operate in parallel with the Booth encoding circuitry 52 generating the partial product selection indicators 62 based on the first operand src_a.
Each adder array 56 receives its set of partial products, and when enabled based on a corresponding enable control signal 64 provided by the enable control circuitry 60, adds its partial products to generate a corresponding product representing value 66 which represents a result of multiplication of a respective pair of portion of bits selected from the first operand src_a and the second operand src_b. A given product representing value 66 represents the numeric result M*R of the product of M (a number represented by a selected portion of bits of the first operand src_a) and R (a number represented by a selected portion of bits of the second operand src_b). For each adder array 56, the portions selected from the first and second operands to represent M and R may be different. To speed up addition of partial products, each adder array 56 may be implemented as a carry-save-adder tree which performs a series of carry-save additions (not carry-propagate additions), which reduces processing time by allowing parallel processing of additions in different bit lanes because there is no dependence of the addition in one bit lane on carries generated in lower bit lanes. Therefore, the product representing value 66 for a given adder may be represented in a carry-save representation using a sum term and a carry term. To generate a binary result for M*R in a two's complement representation, this may require a further addition of the sum term and the carry term using a carry-propagate adder (not shown in
The respective adder arrays 56 are sized to handle different data element configurations within the first and second operands. For example, one subset of adder arrays 56 may implement the additions of partial products for respective pairwise multiplications of pairs of 8-bit data elements within the first and second operands. A second subset of adder arrays 56 may implement the partial products additions for respective pairwise multiplications of pairs of 16-bit data elements within the first and second operands. A third subset of adder arrays 56 may implement the partial products additions for respective pairwise multiplications of pairs of 32-bit data elements within the first and second operands. It will be appreciated that this is just one example of different data element configurations that can be implemented. However, it can be useful to provide separate distinct adder arrays sized appropriate to each data element size configuration, rather than implementing all the data element sizes using a single larger adder array, as this can be more energy efficient because it allows the adder arrays 56 corresponding to data element size configurations not required for a given multiplication operation to be disabled to save power.
In this example, each adder array 56 has its own independent enable control signal 64 which is set independently by the enable control circuitry 60 to independently control whether each adder array is currently enabled or disabled. For example, each enable control signal 64 may be a clock signal used to clock components of the adder array 56, so the enable control circuitry 60 may disable a given adder array by clamping the corresponding clock signal to a fixed value. By preventing the clock signal from toggling, the adder array can be disabled and dynamic power is saved. Other implementations may use a different form of enable control, such as power gating where the enable control signal 64 controls enabling/disabling of the adder array 56 by turning on/off a power gate which controls whether the adder array 56 is coupled to or isolated from a power supply node.
Other examples, may control the independent enable/disable of the adder array at a coarser granularity, on a subset by subset basis. For example, the first, second and third subsets of adder arrays 56 described above could each be provided with an independent enable control signal 64, but adder arrays within the same subset could be enabled/disabled collectively based on the same enable control signal. However, in practice, having independent enable/disable control for each adder array 56 as shown in the example of
The respective adder arrays 56 can also be used cooperatively to implement a larger multiplication, such as the multiplication of wider portions of bits of src_a and src_b which comprise all magnitude-indicating bits of the first and second operands. When the data element size configuration information indicates that the cooperative mode is to be used, the respective products representing values 66 produced by at least a subset of adder arrays (e.g. all of the adder arrays) are added together by the cooperative mode adder 58, to produce a product results indicating the numeric value corresponds the product of the wider portions of src_a and src_b than the portions considered by any individual adder array. For example, in the implementation discussed in more detail below, the cooperative mode implements a 64-bit*64-bit multiplication, by adding the product representing values 66 generated by the various adder arrays 56 which in a non-cooperative mode would handle smaller 8-bit, 16-bit or 32-bit multiplications, as well as a further product representing value generated by a further adder array 56 which handles spare bits of the 64-bit multiplication that are not covered by the other adder arrays 56.
Hence, a larger adder array for a multiplier is constructed from a number of smaller (sub) arrays 56. Some of the smaller subarrays are sized (in terms of power and area) for multiplying smaller data types in a non-cooperative mode. The subarrays 56 can also be used cooperatively to construct larger logical arrays (e.g. for the largest data type).
In the examples discussed below, it is recognised that there is an opportunity to save area by realizing that the subarrays 56 depend on Booth encoding of the same operand src_a or the same subsections of that operand src_a. They do this regardless of whether they are running native sized multiplications in the non-cooperative mode or working collectively to compute the larger datatype in the cooperative mode. An area and timing pressure reducing solution is to pull Booth encoders out of each subarray and implement one set of Booth encoders (collectively referred to as shared Booth encoding circuitry 52) shared between all the adder arrays 56, so that at least two of the adder arrays each receive a partial product selected based on the same partial product selection indicator 62 corresponding to a Booth encoding of a same Booth digit generated by the same set of hardware circuit logic in the shared Booth encoding circuitry.
For comparison,
When the data element size configuration to be used is 8-bit, 16-bit or 32-bit, the corresponding subset of adder arrays is enabled and each adder array within that subset receives partial products selected depending on a corresponding pair of Aug. 16, 1932-bit data elements within the first and second operands src_a, src_b. The results of each adder array can be assembled into a vector result (e.g. by result assembly circuitry 58 which in
It is possible to operate the adder arrays so that the subsets of adder arrays for more than one of the data element size configurations are enabled in parallel, to produce (based on the same source operands src_a, src_b) a first vector result corresponding to one size configuration (e.g. 32-bit) and a second vector result corresponding to a second size configuration (e.g. 16-bit). This may require the result assembly circuitry to be duplicated to allow for output of multiple independent results in the same cycle.
In the 64-bit cooperative configuration, all the adder arrays are enabled, and the respective product representing values 66 generated by the adder arrays are further added by cooperative mode adder 58 to produce a 64-bit multiplication result. Optionally, for a multiply-accumulate operation, a further adder 72 may add the multiplication result (or multiplication results for each data element lane) to corresponding elements of a third operand (still in carry save form to speed up the further adder 72 compared to carry propagate additions).
Optionally, the carry and save terms output by the result assembly circuitry, cooperative mode adder 58 or further adder 72 can be added by a carry propagate adder to produce a result in 2's complement representation, but this is not essential as often the multiply operation may be one of a series of multiply-accumulate operations and so it may be more efficient to retain the result in carry-save form to allow the further adder 72 to perform a faster addition to a previous accumulation result also in carry-save form (with the carry-propagate operation for converting to 2's complement being deferred until after the final accumulation is performed).
In contrast,
As the Booth encoders are shared between adder arrays, this means that more than one adder array 56 receives a partial product selected using the same partial product selection indicator generated by the same Booth encoder (same piece of hardware circuit logic). It is recognised that all subarrays 56 should receive partial products based on Booth encodings of the same sections or subsections of one input operand src_a, and so there can be one set global Booth encoders across the 64-bit input operand. These shared global Booth encoders feed each of the 15 subarray multipliers. When Booth encoders are shared between adder arrays 56 supporting different data element size configurations, the Booth digits that span a data element boundary in one configuration but are at an intermediate point within a data element for another configuration may require different input bits for different element size configurations. Hence, for example, for the 8-bit boundaries within the first operand, there are special Booth encoders that have selection logic to select whether a lower bit of the Booth digit is treated as 0 or is selected from src_a. This is described in more detail below.
The sharing of Booth encoders between adder arrays 56 enables a great circuit area and power saving. In the approach shown in
In contrast, in
The large decrease in encoders also enables a flop stage to be economically split between Booth encoding 52, Booth multiplexing (partial product selection 54) and partial product compression (addition of partial products 56). Previously 143 sets of Booth encoded values would need to be flopped to split here, now only 32 need to be flopped (plus 8 single bit flops for unsigned multiplication). This is a similar size to flopping the encoded 64-bit operand, which no longer needs to be encoded in a second flop stage due to the large numbers of flops required to do so. This is the timing pressure reduction this technique enables.
The 8-bit, 16 bit and 32-bit configurations are examples of non-cooperative modes as each adder array produces an independent product representing value which (subject to assembly into a vector result value and possible conversion from carry-save form to 2's complement), does not require further additions with other product representing values to produce a multiplication result.
In contrast,
It will be appreciated that
In contrast, with the approach shown in
Hence, while
At the lowest Booth digit in each data element, the bottom bit of the Booth digit is 0. When considering the different element size configurations, the Booth digits encoded by Booth encoders benc4, benc8, benc12, benc20, benc24, benc28 span a data element boundary for at least one data element size configuration, but are at an intermediate position within an element for at least one other data element size configuration. Also, the Booth digit encoded by Booth encoder benc16 spans data element boundaries for each of the 8-bit, 16-bit and 32-bit configurations, but is at the intermediate position within the 64-bit values being multiplied when the cooperative 64-bit configuration is selected. Therefore, the Booth encoders marked with dashed lines have selection circuitry to select whether the bottom bit of that Booth digit is either a fixed value of 0b0 (used for data element size configurations where that Booth digit spans a data element boundary) or is the top bit of the next least significant 8-bit portion of src_a (used for data element size configurations where that Booth digit is at an intermediate position within a data element not spanning a data element boundary). The text under each 8-bit portion in
For each Booth encoder benc0 to benc31, the Booth encoder generates a partial product selection indicator to indicate which multiple of a corresponding portion of the second operand should be selected as a partial product, according to the Booth encoding rules for Radix 4:
If the multiplication operation is being performed on signed values represented in two's complement format, the top bit of each element will be a sign bit. Hence, for an example using Radix-4 assuming data element sizes having a power-of-2 number of bits, although nominally each data element would require one further Booth digit to be encoded corresponding to a sign extension of the top bit of that data element, that further Booth digit would always have a value of 000 and 111 (as the sign extension causes the top bit of the data element to be copied to the other two bits of the Booth digit), and so this would not require any further partial product to be added as the Booth encoding rules shown above give a multiple of 0 for both 000 and 111. Therefore, if only signed multiplications are to be supported, for the Radix-4 example it is sufficient to use the Booth encoders benc0 to benc31 shown in
However, if unsigned multiplications are to be supported for Radix-4, then the top bit of each element could be 0 or 1 and is padded with 0s (not sign extension bits) to form the top Booth digit in each data element. This means the possible values for the top Booth digit in each element are 000 or 001, and so as the 001 alternative requires an addition of a +1*R multiple, a further partial product may be needed. Therefore, to support unsigned multiplications with the smallest data element size configuration for the 8-bit example, each 8-bit portion of the first operand src_a is also provided with a further unsigned Booth encoder “umul”-hence there are 8 such unsigned Booth encoders, labelled umul4, umul8, umul12, umul16, umul20, umul24, umul28, umul32 respectively (the numbers used to label the unsigned Booth encoders umul4-umul28 being chosen to denote that the Booth digits encoded by these unsigned Booth encoders are of equivalent significance to the Booth digits encoded by Booth encoders benc4, benc8, benc12, benc16, benc20, benc24, benc28 respectively and that umul32 is encoding a Booth digit more significant than the Booth digit encoded by benc31). Each unsigned Booth encoder can have simpler hardware circuit logic than the other Booth encoders benc0-benc31, because it only needs to select from two options (0 or +1*R) based on whether the top bit of the corresponding 8-bit portion of src_a is 0 or 1 respectively.
On the other hand, if a higher radix (e.g. Radix-8) was used, or if Radix-4 was applied to data elements of a size other than an exact power of 2, then depending on the Radix and the data element size used, some Booth digits could be aligned relative to the element boundary so that the further Booth digit which represents a sign-extension of the top bits of the data element could include two or more bits of src_a, not just one bit of src_a as in the Radix-4 example of
For example, for adder array “8 bit mul 0”, the Booth encoders used are benc 3, benc2, benc1, benc0 (and for an unsigned multiplication, umul4). It can be seen from
Note that the extra bit adder array 56 uses all the Booth encoders benc31 to benc0 (because it is mapped in
However, it can be seen that each of the other adder arrays can be considered to be part of a subset of adder arrays corresponding to a particular data element size configuration (e.g. the 8-bit adder array subset comprising 8 bit mul0 to 8 bit mul 3, the 16-bit adder array subset comprising 16 bit mul 0 to 16 bit mul 3 and the 32-bit adder array subset comprising 32 bit mul 0 and 32 bit mul 1) and that within a given one of these subsets, the adder arrays of that subset collectively use all the Booth encoders benc31 to benc0 but any one adder array of that subset only uses some of those Booth encoders. Each adder array of the 8-bit, 16-bit and 32-bit subsets also uses one unsigned Booth encoder to support unsigned multiplication.
Note also the sharing of the Booth encoders between the subsets of the adder arrays. Any given one of the Booth encoders benc0 to benc31 is used by one adder array in each subset. For example, Booth encoder benc0 is used by 8 bit mul 0, 16 bit mul 0 and 32 bit mul 0 (as well as by the extra bits mul adder array). Booth encoder benc27 is used by 8 bit mul 6, 16 bit mul 3 and 32 bit mul 1 (as well as by the extra bits mul adder array). Hence, there are several adder arrays which each receive a respective partial product which is selected based on exactly the same partial product selection indicator generated by a same Booth encoder (i.e. a same piece of hardware circuit logic) of the shared Booth encoding circuitry. This greatly reduces the amount of hardware circuit area required to implement the Booth encoders compared to the non-shared approach shown in
Note that the specific mapping of the adder arrays shown in
It will be appreciated that the adder array diagrams in
At step 302, the partial product selection circuitry 54 selects, based on the second operand src_b and the partial product selection indicators 62 generated by the shared Booth encoding circuitry 52, the respective sets of partial products to be added by each adder array 56. At least two of the adder arrays 56 receive partial products selected based on the same partial product selection indicator based on Booth encoding of the same Booth digit of the first operand (e.g. as shown in
At step 304, any enabled adder arrays 56 (those adder arrays 56 for which enable control circuitry 60 has asserted the enable control signals 64 to cause the adder array to be enabled) add their respective partial products, to generate respective product representing values each representing a result of multiplication of a respective pair of portions of bits selected from the first and second operands src_a, src_b. Any adder arrays which are currently disabled (e.g. based on clamping a clock signal to a fixed value) do not add their partial products.
At step 306, if the current data element size configuration is based on a cooperative mode, the cooperating mode adder 58 adds the respective product representing values obtained by each adder array 56. In a non-cooperative mode (e.g. 8-bit, 16-bit or 32-bit configuration shown above), step 306 is not performed and instead the respective product representing values can be assembled into a result without further addition. In some implementations, the product representing values may be truncated to fit into a result of same size as the input operands src_a, src_b. Alternatively, some variants may product a result of double the size of the input operands, to preserve all bits of the product representing values.
At step 308, if performing a multiply-add operation, a further addition of the multiply result (either the result of the 64-bit multiplication produced at step 306 in the cooperative mode, or the assembled multiplication results in the non-cooperative mode configurations) and a third operand is performed, e.g. using vector adder 72 (the third operand can be a SIMD operand in the non-cooperative mode).
Concepts described herein may be embodied in computer-readable code for fabrication of an apparatus that embodies the described concepts. For example, the computer-readable code can be used at one or more stages of a semiconductor design and fabrication process, including an electronic design automation (EDA) stage, to fabricate an integrated circuit comprising the apparatus embodying the concepts. The above computer-readable code may additionally or alternatively enable the definition, modelling, simulation, verification and/or testing of an apparatus embodying the concepts described herein.
For example, the computer-readable code for fabrication of an apparatus embodying the concepts described herein can be embodied in code defining a hardware description language (HDL) representation of the concepts. For example, the code may define a register-transfer-level (RTL) abstraction of one or more logic circuits for defining an apparatus embodying the concepts. The code may define a HDL representation of the one or more logic circuits embodying the apparatus in Verilog, SystemVerilog, Chisel, or VHDL (Very High-Speed Integrated Circuit Hardware Description Language) as well as intermediate representations such as FIRRTL. Computer-readable code may provide definitions embodying the concept using system-level modelling languages such as SystemC and SystemVerilog or other behavioural representations of the concepts that can be interpreted by a computer to enable simulation, functional and/or formal verification, and testing of the concepts.
Additionally or alternatively, the computer-readable code may define a low-level description of integrated circuit components that embody concepts described herein, such as one or more netlists or integrated circuit layout definitions, including representations such as GDSII. The one or more netlists or other computer-readable representation of integrated circuit components may be generated by applying one or more logic synthesis processes to an RTL representation to generate definitions for use in fabrication of an apparatus embodying the invention. Alternatively or additionally, the one or more logic synthesis processes can generate from the computer-readable code a bitstream to be loaded into a field programmable gate array (FPGA) to configure the FPGA to embody the described concepts. The FPGA may be deployed for the purposes of verification and test of the concepts prior to fabrication in an integrated circuit or the FPGA may be deployed in a product directly.
The computer-readable code may comprise a mix of code representations for fabrication of an apparatus, for example including a mix of one or more of an RTL representation, a netlist representation, or another computer-readable definition to be used in a semiconductor design and fabrication process to fabricate an apparatus embodying the invention. Alternatively or additionally, the concept may be defined in a combination of a computer-readable definition to be used in a semiconductor design and fabrication process to fabricate an apparatus and computer-readable code defining instructions which are to be executed by the defined apparatus once fabricated.
Such computer-readable code can be disposed in any known transitory computer-readable medium (such as wired or wireless transmission of code over a network) or non-transitory computer-readable medium such as semiconductor, magnetic disk, or optical disc. An integrated circuit fabricated using the computer-readable code may comprise components such as one or more of a central processing unit, graphics processing unit, neural processing unit, digital signal processor or other components that individually or collectively embody the concept.
Concepts described herein may be embodied in a system comprising at least one packaged chip. The multiplication circuitry described earlier is implemented in the at least one packaged chip (either being implemented in one specific chip of the system, or distributed over more than one packaged chip). The at least one packaged chip is assembled on a board with at least one system component. A chip-containing product may comprise the system assembled on a further board with at least one other product component. The system or the chip-containing product may be assembled into a housing or onto a structural support (such as a frame or blade).
As shown in
In some examples, a collection of chiplets (i.e. small modular chips with particular functionality) may itself be referred to as a chip. A chiplet may be packaged individually in a semiconductor package and/or together with other chiplets into a multi-chiplet semiconductor package (e.g. using an interposer, or by using three-dimensional integration to provide a multi-layer chiplet product comprising two or more vertically stacked integrated circuit layers).
The one or more packaged chips 400 are assembled on a board 402 together with at least one system component 404. For example, the board may comprise a printed circuit board. The board substrate may be made of any of a variety of materials, e.g. plastic, glass, ceramic, or a flexible substrate material such as paper, plastic or textile material. The at least one system component 404 comprises one or more external components which are not part of the one or more packaged chip(s) 400. For example, the at least one system component 404 could include, for example, any one or more of the following: another packaged chip (e.g. provided by a different manufacturer or produced on a different process node), an interface module, a resistor, a capacitor, an inductor, a transformer, a diode, a transistor and/or a sensor.
A chip-containing product 410 is manufactured comprising the system 406 (including the board 402, the one or more chips 400 and the at least one system component 404) and one or more product components 412. The product components 412 comprise one or more further components which are not part of the system 406. As a non-exhaustive list of examples, the one or more product components 412 could include a user input/output device such as a keypad, touch screen, microphone, loudspeaker, display screen, haptic device, etc.; a wireless communication transmitter/receiver; a sensor; an actuator for actuating mechanical motion; a thermal control device; a further packaged chip; an interface module; a resistor; a capacitor; an inductor; a transformer; a diode; and/or a transistor. The system 406 and one or more product components 412 may be assembled on to a further board 414.
The board 402 or the further board 414 may be provided on or within a device housing or other structural support (e.g. a frame or blade) to provide a product which can be handled by a user and/or is intended for operational use by a person or company.
The system 406 or the chip-containing product 416 may be at least one of: an end-user product, a machine, a medical device, a computing or telecommunications infrastructure product, or an automation control system. For example, as a non-exhaustive list of examples, the chip-containing product could be any of the following: a telecommunications device, a mobile phone, a tablet, a laptop, a computer, a server (e.g. a rack server or blade server), an infrastructure device, networking equipment, a vehicle or other automotive product, industrial machinery, consumer device, smart card, credit card, smart glasses, avionics device, robotics device, camera, television, smart television, DVD players, set top box, wearable device, domestic appliance, smart meter, medical device, heating/lighting control device, sensor, and/or a control system for controlling public infrastructure equipment such as smart motorway or traffic lights.
Some examples are set out in the following clauses:
1. Multiplication circuitry comprising:
a plurality of adder arrays each to add a respective set of partial products to generate a respective product representing value representing a result of multiplication of a respective pair of portions of bits selected from a first operand and a second operand, the plurality of adder arrays comprising separate instances of hardware circuitry, the plurality of adder arrays having at least two separate enable control signals for independently controlling whether at least two subsets of the adder arrays are enabled or disabled;
shared Booth encoding circuitry, shared between the plurality of adder arrays, to Booth encode the first operand to generate a plurality of partial product selection indicators, each partial product selection indicator corresponding to a Booth encoding of a respective Booth digit of the first operand; and
partial product selection circuitry to select, based on the second operand and the plurality of partial product selection indicators, the partial products to be added by the plurality of adder arrays; in which:
at least two of the adder arrays are configured to operate on respective partial products selected by the partial product selection circuitry based on a same partial product selection indicator generated by the shared Booth encoding circuitry based on the Booth encoding of a same Booth digit of the first operand.
2. The multiplication circuitry according to clause 1, wherein the multiplication circuitry is configured to support at least two data element size configurations for multiplication of one or more respective pairs of data elements selected from the first operand and the second operand, each data element size configuration corresponding to a different combination of data element sizes for the data elements selected from the first operand and the second operand.
3. The multiplication circuitry according to clause 2, comprising enable control circuitry to select, based on the current data element size configuration to be used for the multiplication operation, which of the adder arrays are to be enabled and which of the adder arrays are to be disabled.
4. The multiplication circuitry according to any of clauses 2 and 3, in which, for the Booth encoding of a given Booth digit of the first operand:
the partial product selected for the adder array of the first subset based on said same partial product selection indicator is dependent on a first subset of bits of the second operand when the multiplication is to be performed according to the first data element size configuration; and
the partial product selected for the adder array of the second subset based on said same partial product selection indicator is dependent on a second subset of bits of the second operand when the multiplication is to be performed according to the second data element size configuration.
8. The multiplication circuitry according to any of clauses 5 to 7, comprising enable control circuitry to:
product adding circuitry to add, for a multiplication operation in which two or more of the adder arrays are to operate in a cooperative mode, the respective product representing values generated by two or more of the adder arrays to generate a product result value representing the multiplication of wider portions of bits of the first operand and the second operand than the portions of bits used to generate the respective product representing values for any one of said two or more of the adder arrays.
12. The multiplication circuitry according to clause 11, in which for at least one adder array, the portion of the second operand used to form the partial products for that adder array is variable depending on whether the multiplication operation is to be performed in the cooperative mode.
13. The multiplication circuitry according to any of clauses 11 and 12, in which said wider portions of bits of the first operand and the second operand comprise all magnitude-indicating bits of the first operand and the second operand.
14. The multiplication circuitry according to any of clauses 11 to 13, in which the plurality of adder arrays comprise:
processing circuitry to perform data processing in response to instructions;
the processing circuitry comprising the multiplication circuitry according to any of clauses 1 to 16.
18. A system comprising:
wherein the at least one packaged chip and the at least one system component are assembled on the board.
19. A chip-containing product comprising the system of clause 18 assembled on a further board with at least one other product component.
20. A method comprising:
using shared Booth encoding circuitry shared between a plurality of adder arrays, Booth encoding a first operand to generate a plurality of sets of partial product selection indicators, each partial product selection indicator corresponding to a Booth encoding of a respective Booth digit of the first operand, wherein the plurality of adder arrays comprise separate instances of hardware circuitry, the plurality of adder arrays having at least two separate enable control signals for independently controlling whether at least two subsets of the adder arrays are enabled or disabled;
selecting, based on a second operand and the plurality of partial product selection indicators, respective sets of partial products to be added by a plurality of adder arrays; and
adding the partial products using enabled adder arrays of the plurality of adder arrays to generate respective product representing values each representing a result of multiplication of a respective pair of portions of bits selected from the first operand and the second operand,
wherein at least two of the adder arrays are configured to operate on respective partial products selected based on a same partial product selection indicator generated by the shared Booth encoding circuitry based on the Booth encoding of a same Booth digit of the first operand.
21. A computer-readable medium to store computer-readable code for fabrication of multiplication circuitry comprising:
a plurality of adder arrays each to add a respective set of partial products to generate a respective product representing value representing a result of multiplication of a respective pair of portions of bits selected from a first operand and a second operand, the plurality of adder arrays comprising separate instances of hardware circuitry, the plurality of adder arrays having at least two separate enable control signals for independently controlling whether at least two subsets of the adder arrays are enabled or disabled;
shared Booth encoding circuitry, shared between the plurality of adder arrays, to Booth encode the first operand to generate a plurality of partial product selection indicators, each partial product selection indicator corresponding to a Booth encoding of a respective Booth digit of the first operand; and
partial product selection circuitry to select, based on the second operand and the plurality of partial product selection indicators, the partial products to be added by the plurality of adder arrays; in which:
at least two of the adder arrays are configured to operate on respective partial products selected by the partial product selection circuitry based on a same partial product selection indicator generated by the shared Booth encoding circuitry based on the Booth encoding of a same Booth digit of the first operand.
In the present application, the words “configured to . . . ” are used to mean that an element of an apparatus has a configuration able to carry out the defined operation. In this context, a “configuration” means an arrangement or manner of interconnection of hardware or software. For example, the apparatus may have dedicated hardware which provides the defined operation, or a processor or other processing device may be programmed to perform the function. “Configured to” does not imply that the apparatus element needs to be changed in any way in order to provide the defined operation.
In the present application, lists of features preceded with the phrase “at least one of” mean that any one or more of those features can be provided either individually or in combination. For example, “at least one of: A, B and C” encompasses any of the following options: A alone (without B or C), B alone (without A or C), C alone (without A or B), A and B in combination (without C), A and C in combination (without B), B and C in combination (without A), or A, B and C in combination.
Although illustrative embodiments of the invention have been described in detail herein with reference to the accompanying drawings, it is to be understood that the invention is not limited to those precise embodiments, and that various changes and modifications can be effected therein by one skilled in the art without departing from the scope of the invention as defined by the appended claims.