The present disclosure relates to data processing.
In a data processing apparatus, multiplication operations can be processor and power intensive. In addition, they are often large, which itself leads to an increase in power consumption for the data processing apparatus. It is desirable to implement multiplication circuits in such a way that they are small and such that they are able to operate with a small amount of power.
Viewed from a first example configuration, there is provided a data processing apparatus comprising: an A×B multiplier array comprising a first plurality of logic gates clocked by a first clock signal, where A and B are both integers; a C×D multiplier array, separate from the A×B multiplier array, comprising a second plurality of logic gates clocked by a second clock signal, where C and D are both integers; and addition circuitry configured to perform an addition operation between a first at least partial product produced by the A×B multiplier array and a second at least partial product produced by the C×D multiplier array.
Viewed from a second example configuration, there is provided a data processing method comprising: providing a first clock signal to an A×B multiplier array comprising a first plurality of logic gates, where A and B are both integers; providing a second clock signal to a C×D multiplier array, separate from the A×B multiplier array, comprising a second plurality of logic gates, where C and D are both integers; and performing an addition operation between a first at least partial product produced by the A×B multiplier array and a second at least partial product produced by the C×D multiplier array.
Viewed from a third example configuration, there is provided a non-transitory computer-readable medium to store computer-readable code for fabrication of a data processing apparatus comprising: an A×B multiplier array comprising a first plurality of logic gates clocked by a first clock signal, where A and B are both integers; a C×D multiplier array, separate from the A×B multiplier array, comprising a second plurality of logic gates clocked by a second clock signal, where C and D are both integers; and addition circuitry configured to perform an addition operation between a first at least partial product produced by the A×B multiplier array and a second at least partial product produced by the C×D multiplier array.
The present invention will be described further, by way of example only, with reference to embodiments thereof as illustrated in the accompanying drawings, in which:
Before discussing the embodiments with reference to the accompanying figures, the following description of embodiments is provided.
In accordance with one example configuration there is provided a data processing apparatus comprising: an A×B multiplier array comprising a first plurality of logic gates clocked by a first clock signal, where A and B are both integers; a C×D multiplier array, separate from the A×B multiplier array, comprising a second plurality of logic gates clocked by a second clock signal, where C and D are both integers; storage circuitry to simultaneously store a first at least partial product produced by the A×B multiplier array and a second at least partial product produced by the C×D multiplier array; and addition circuitry to perform an addition operation on contents of the storage circuitry.
In the above examples, both an A×B multiplier array and a C×D multiplier array are provided. The A×B multiplier array comprises a set of logic gates (e.g. AND gates) that are arranged in order to produce the at least partial products for a multiplication of a first A bits of a first operand and B bits of a second operand. Similarly, the C×D multiplier array contains a different set of logic gates (e.g. AND gates that independent from the logic gates of the A×B multiplier array) and is able to perform a multiplication on C bits of a first operand and D bits of a second operand. The operands operated on by the A×B multiplier array and the C×D multiplier array could be completely different operands or could be the same operands, as will be discussed below. Each of the arrays receives a different clock signal. In this way, it is possible to effectively ‘freeze’ one of the arrays, preventing it from moving data around. This acts as an efficient way to enable or disable one of the circuits, since without the ‘tick’ of a clock signal, no output is produced by the relevant multiplier array and energy is not consumed as a result of switching. Furthermore, as opposed to a technique such as data gating (in which an enable/disable signal is asserted to the individual logic gates to indicate that they should or should not operate), this can be achieved without a further set of logic gates being added. This is important because although data gating might, for instance, help to reduce power consumption, it does so as the addition of a further set of logic gates—each of which will consume a small amount of power (e.g. due to leakage currents and so on). In these examples, the at least partial products produced by the A×B multiplier array and the C×D multiplier array can be simultaneously stored within the same storage circuitry and added together by the addition circuitry. Thus, it is possible for each of the multiplier arrays to be used in isolation or together in order to perform a multiplication operation. Consequently, circuit space is also saved as a consequence of not merely having a large number of independent multiplier arrays. Note that although the clock signal provided to each of the multiplier arrays is different (in at least some modes of operation), the clock signals could have the same source and could also have the same value (in some other modes of operation).
In some examples, in an A×B mode of operation, the first clock signal operates at a higher frequency than the second clock signal. A higher frequency clock signal produces the ‘tick’ signal at a higher rate than a lower frequency clock signal. Consequently, the speed at which data passes through transistors/logic gates of the A×B multiplier array is increased as compared to the C×D multiplier array.
In some examples, in the A×B mode of operation, the first clock signal is such that the contents of the storage circuitry on which the addition operation is performed by the addition operation includes the first at least partial product and excludes the second at least partial product. The clock signal provided to the A×B multiplier array is therefore sufficiently fast that when the addition operation is to be performed, the at least partial product(s) produced by the A×B multiplication will have completed. In contrast, the clock signal provided to the C×D multiplier array is so slow (or non-existent) such that no at least partial product(s) are produced. Consequently, the only addition to be performed by the addition circuitry is on the at least partial product(s) produced by the A×B multiplier array. Thus, in the A×B mode of operation, multiplication is performed on A bits of a first operand and B bits of a second operand, without any significant energy being consumed by the C×D multiplier array. It is therefore possible to perform an A×B multiplication without significant expenditure of energy on other circuits.
In some examples, in a C×D mode of operation, the second clock signal operates at a higher frequency than the first clock signal. A higher frequency clock signal produces the ‘tick’ signal at a higher rate than a lower frequency clock signal. Consequently, the speed at which data passes through transistors/logic gates of the C×D multiplier array is increased as compared to the A×B multiplier array.
In some examples, in the C×D mode of operation, the second clock signal is such that the contents of the storage circuitry on which the addition operation is performed by the addition operation excludes the first at least partial product and includes the second at least partial product. The clock signal provided to the C×D multiplier array is therefore sufficiently fast that when the addition operation is to be performed, the at least partial product(s) produced by the C×D multiplication will have completed. In contrast, the clock signal provided to the A×B multiplier array is so slow (or non-existent) such that no at least partial product(s) are produced. Consequently, the only addition to be performed by the addition circuitry is on the at least partial product(s) produced by the C×D multiplier array. Thus, in the C×D mode of operation, multiplication is performed on C bits of a first operand and D bits of a second operand, without any significant energy being consumed by the A×B multiplier array. It is therefore possible to perform a C×D multiplication without significant expenditure of energy on other circuits.
In some examples, in a combined mode of operation, the first clock signal and the second clock signal operate at frequencies such that the contents of the storage circuitry on which the addition operation is performed by the addition operation includes the first at least partial product and the second at least partial product. In this mode of operation, the at least partial products produced by each of the A×B multiplier and the C×D multiplier are added together as part of the addition operation due to the first clock signal and the second clock signal being comparable.
In some examples, in the combined mode of operation, the first clock signal and the second clock signal operate at a same frequency. In these examples, the first clock signal and the second clock signal could be the same signal (i.e. operating from a same source signal without the signal being modified differently for the A×B multiplier array or the C×D multiplier array).
In some examples, in the combined mode of operation, the A×B multiplier array and the C×D multiplier array cooperate to perform an M×N multiplication where M=A+C and N=B+D. The two multiplier arrays can therefore be ‘combined’ in order to perform a multiplication on larger operands than can be achieved individually on the A×B multiplier array or the C×D multiplier array. In particular, the M bits of a first operand can be split between the A×B multiplier array and the C×D multiplier array, together with the N bits of a second operand. By then operating each of the A×B multiplier array and the C×D multiplier array to produce at least partial products, and then combining the at least partial products in the addition operation, the result is the same as if a single M×N multiplier array had been used.
In some examples, in the combined mode of operation, A bits of a first operand are processed by the A×B multiplier array, B bits of a second operand are processed by the A×B multiplier array, C bits of the first operand are processed in the C×D multiplier array, and D bits of the second operand are processed in the C×D multiplier array. The two multiplier arrays are therefore ‘filled’ in order to effectively perform the M×N multiplication. Of course, it is also possible for a multiplication smaller than M×N to be performed using the same circuitry (e.g. a multiplication bigger than either of A×B or C×D, yet smaller than M×N) by padding bits in the M×N multiplication circuitry. However, this can result in wasted energy consumption as a consequence of some of the cells in the multiplication arrays being used to no useful effect. This can be alleviated by adding further multiplication arrays whose size more closely corresponds with the multiplications being performed.
In some examples, in the combined mode of operation, an upper A bits of a first operand and a lower B bits of a second operand are processed in the A×B multiplier array, and a lower C bits of the first operand and an upper D bits of the second operand are processed in the C×D multiplier array. With such a configuration, the resulting partial problems that are produced often fall within a smaller space than if the upper C bits of the first operand and lower D bits of the second operand are processed in the C×D multiplier array. This configuration therefore leads to a smaller storage circuitry being required in order to store the at least partial products that are produced by the multiplier arrays.
In some examples, the data processing apparatus comprises: an E×F multiplier array, separate from the A×B multiplier array and the C×D multiplier array, comprising a third plurality of logic gates clocked by a third clock signal, where E and F are both integers.
In some examples, the C×D multiplier array and the E×F multiplier array are multiplier array fragments and are non-square. In these examples, C and D differ from each other, and also E and F differ from each other, such that the multiplier arrays are non-square. These multiplier arrays can therefore be used to perform multiplications on differently sized operands.
In some examples, in the combined mode of operation, the A×B multiplier array and the C×D multiplier array and the E×F multiplier array cooperate to perform the M×N multiplication where M=A+C+E and N=B+D+F. Although the C×D multiplier array and the E×F multiplier array are non-square and can therefore be used to perform multiplications on differently sized operands, a more common use for the multiplier arrays is to expand the range of multiplication that can be achieved by another multiplier array (in this example, the A×B multiplier array). The A×B multiplier array can therefore be used to perform multiplication on one common size of operands. Meanwhile, larger multiplications can be performed (i.e. multiplication on larger operands can be achieved) by additionally using the C×D multiplier array and/or the E×F multiplier array. Bits of the two (large) operands are therefore split between the A×B multiplier array, the C×D multiplier array and the E×F multiplier array. The resulting at least partial products are then combined in the addition operation to produce a result of multiplying the two large operands.
In some examples, M and N are both 24. A 24×24 bit multiplication array is well-suited for performing multiplication on a single precision floating point number, whose mantissa is 23 bits (24 with the implicit whole number at the beginning).
In some examples, A and B are both 11. An 11×11 multiplication array is well-suited for performing multiplication on a half precision floating point number, whose mantissa is 10 bits (11 with the implicit whole number at the beginning). Consequently, such circuitry can efficiently perform multiplication for half precision floating point numbers. Meanwhile, by also using the C×D multiplier array, it is possible for larger multiplications to be performed. In some embodiments, the other multiplier arrays (C×D and/or E×F and/or others) are such that a 24×24 bit multiplication can also be efficiently performed, thereby enabling the efficient performance of both half and single precision floating point numbers. In one mode of operation, only the A×B multiplier array is used (by virtue of a higher frequency clock signal than is provided to the other multiplier arrays). In another mode of operation, the same clock signal can be provided to all of the multiplier arrays, and consequently, larger (e.g. 24×24 bit) multiplications can be performed. All of this can be achieved without having one 11×11 bit multiplier array and a separate 24×24 bit multiplier array, which would be wasteful of space. Similarly, this can be achieved without data gating, which would fail to reduce power consumption as much as the present techniques are able to achieve.
In some examples, the addition circuitry comprises first addition circuit for combining a first combination of at least partial products and a second addition circuit for combining a second combination of at least partial products in dependence on a selection signal. It may be that particular combinations of at least partial products are more efficiently added together. Consequently, different addition circuits can be provided to improve efficiency depending on the mode of operation in which the circuitry is operating. For example, if some multiplier arrays can be used to perform an integer dot product operation and other (or additional) multiplier arrays are also or alternatively used to perform floating-point related multiplications (e.g. of mantissas) then separate adder circuits could be provided to perform the integer dot product and to perform the floating point multiplication.
Particular embodiments will now be described with reference to the figures.
The execute stage 16 includes a number of processing units, for executing different classes of processing operation. For example the execution units may include an arithmetic/logic unit (ALU) 20 for performing arithmetic or logical operations; a floating-point unit (FPU) 22 for performing operations on floating-point values, a branch unit 24 for evaluating the outcome of branch operations and adjusting the program counter which represents the current point of execution accordingly; and a load/store unit 28 for performing load/store operations to access data in a memory system 8, 30, 32, 34. In this example the memory system include a level one data cache 30, the level one instruction cache 8, a shared level two cache 32 and main system memory 34. It will be appreciated that this is just one example of a possible memory hierarchy and other arrangements of caches can be provided. The specific types of processing unit 20 to 28 shown in the execute stage 16 are just one example, and other implementations may have a different set of processing units or could include multiple instances of the same type of processing unit so that multiple micro-operations of the same type can be handled in parallel. It will be appreciated that
In a data processing apparatus, multiplication proceeds via a number of stages, which closely resembles the concept of ‘long multiplication’ similar to the process by which multiplication of numbers can be achieved by hand. For example, consider the multiplication of a first operand a[3:0] with a second operand b[3:0] using the 4×4 multiplier array 200 shown in
These partial products are then added together to form the final product, which is the result of the multiplication. Given the number of additions that occur, the addition can be performed efficiently by means of a carry-save-adder (CSA) network (which might operate on each logical column individually). CSA networks are particularly efficient when chained together since they output addition results as pairs of (carry, save) words (e.g. they output a redundant representation). A final non-carry-save adder is then provided in order to add the carry and save words together to produce a final product for the multiplication, which is equal to s[3:0]*b[3:0].
The CSA network may form part of the multiplier array 200 or may be separate.
As the skilled person will appreciate, although
As explained above, different combinations of the multiplier arrays make it possible to perform multiplications of different sizes (e.g. where the operands have different numbers of bits). For the multiplier arrays that are unused, it is possible to insert ‘null’ data (e.g. with leading zeros in the case of a positive number) or to pad the input operands so that those multiplier circuits have no effect on the outcome. This, however, causes unnecessary processing to be performed in order to produce output bits that have no effect on the final result. This therefore increases power consumption of the data processing apparatus. Another possibility is to provide an ‘enable’ signal to each multiplier array so that multiplier arrays can be individually enabled or disabled. In particular, each AND operation illustrated in
The present technique provides a different clock signal to different multiplier arrays 310, 320, 330, 340. Typically, each logic gate in such a data processing apparatus receives a clock signal in order to control the input and output of data. This prevents or inhibits incoming data to a logic gate from overwriting outgoing data of the logic gate. By using different clock signals for each of the multiplier arrays 310, 320, 330, 340, it is possible to ‘freeze’ some of the multiplier arrays. Meanwhile, since the logic gates already receive a clock signal, this using or freezing can be achieved without the addition of extra logic gates. Thus, processing of individual multiplier arrays can be inhibited (saving power) without a significant increase in the number of logic gates (which consume circuit space and power).
Each of the multiplier arrays 410, 420 produces a set of partial products as previously described. These are passed to CSA networks 430, 440 and then to an (optional) regular adder 450, 460 to produce a set of intermediate products. These intermediate products are then added together by addition circuitry 470 (which may comprise further CSAs and non-CSAs). The regular adders 450, 460 are optional in the sense that the intermediate products could be kept in redundant representation and combined as part of the addition circuitry 470. Furthermore, as previously discussed, the CSAs are optional and all of the partial products could be directly added together by the addition circuitry 470.
By controlling the clock signal provided to each of the multiplier arrays 410, 420, the multiplier arrays 410, 420 can be individually rendered inert, thereby dramatically reducing their power consumption without significant increase to the circuit logic. In this way, the first multiplier array 410 can be used to perform a multiplication of one size, or the second multiplier array 420 can be used to perform a multiplication of a different size, or both multiplier arrays 410, 420 can be used together to perform a multiplication of a third (combined) size. Where only a single multiplier array 410, 420 is used, the addition circuitry 470 can be bypassed, since the intermediate products represent the final product.
By virtue of this configuration, it is possible to activate only the smallest multiplier array(s) that are needed to complete a particular multiplication. Consequently, the entire set of multiplier arrays 410, 420 need not always be activated thereby saving power. At the same time, there is less chance of individual cells of the multiplier arrays 410, 420 going unused—which would be the case if only a single large multiplier array was provided. Additionally, the multiplier arrays 410, 420 can be made to cooperate and therefore circuit space (and power) is not wasted by having one multiplier array for each individual possible multiplication that might occur. In addition, by activating or deactivating the multiplier arrays 410, 420 by using clock signals, it is not necessary to add significant additional logic to the circuitry and so power consumption is not increased in order to implement a technique that can reduce power consumption.
In the above example, it is assumed that the addition circuitry 470 is clocked according to the higher frequency of either the first and second clock signals. Other techniques, that cause the addition circuitry to operate with whichever intermediate products have been provided, may also be possible.
Other configurations of multiplier arrays are possible.
Thus, this circuitry can be used to efficiently perform multiplications on 8-bit integers to produce integer dot products (IDPs), single-precision floating point numbers, and half-precision floating point numbers. The actual arrangement of the circuits shown in
One of the 8×8 multiplier arrays 660 is a sub-multipliers of one of the 11×11 bit multipliers 650. This provides an example of data-gating. In particular, this is a single 11×11 multiplier 650 that can be forced to act as an 8×8 multiplier array by means of a control signal to some of the cells in the 11×11 multiplier array 650. Although this could be less efficient when carried out for the entire set of multiplier arrays, when it is performed in such a limited manner (to cause an 11×11 multiplier array 650 to act as an 8×8 multiplier array 660) the number of extra logic gates required, and the number of logic gates that would actually be deactivated is relatively small and so the cost is also comparatively small.
The storage of the resulting intermediate products is shown at the bottom of
The outputs are then compressed and combined (e.g. added) in separate adder circuits 810, 820. One addition circuit 820 is used for performing the integer dot product operation using the four 8×8 multipliers (36 bits are provided, in order to enable an 8×8 multiply-accumulate operation to proceed on four 8×8 multiplications). This circuitry 820 is activated via an IDPvalid signal that indicates whether an integer dot product is being performed. When signalled, the bits that are taken as inputs into at least some of the 8×8 multipliers are different to the bits taken in when the IDPvalid signal is not asserted. For instance, when asserted, one 8×8 multiplier circuit 730 might take the first 8 bits of op1 and the first 8 bits of op2 and a second 8×8 multiplier circuit 740 might take the second 8 bits of op1 and the second 8 bits of op2 and the third 8×8 multiplier circuit might take a third 8 bits of opt and a third 8 bits of op2 and so on. In this way, completely independent sets of bits can be provided to each of the 8×8 multipliers when a multiply-accumulate operation is being performed. Similarly for the 11×11 multipliers 710, 780 when half precision multiplication is performed. This same signal can be used to enable additional input flops that enable the op1 and op2 to be extended from 24-bit to 32-bit). A separate addition circuit 810 is used to perform other additions such as the single precision determination. This circuitry is enabled or disabled via an SPvalid signal that indicates whether a single point precision addition is being performed. Where half precision operations are being performed, neither of the addition circuits 810, 820 is required and the high and low bits output by the multiplexers 710, 780(770) can simply be merged. The addition circuits 810, 820 could therefore be bypassed, or they could be added to intermediate products of 0 via the addition circuit 810.
This variant 900 uses less circuit space than the configuration 800 shown in
Concepts described herein may be embodied in computer-readable code for fabrication of an apparatus that embodies the described concepts. For example, the computer-readable code can be used at one or more stages of a semiconductor design and fabrication process, including an electronic design automation (EDA) stage, to fabricate an integrated circuit comprising the apparatus embodying the concepts. The above computer-readable code may additionally or alternatively enable the definition, modelling, simulation, verification and/or testing of an apparatus embodying the concepts described herein.
For example, the computer-readable code for fabrication of an apparatus embodying the concepts described herein can be embodied in code defining a hardware description language (HDL) representation of the concepts. For example, the code may define a register-transfer-level (RTL) abstraction of one or more logic circuits for defining an apparatus embodying the concepts. The code may be define a HDL representation of the one or more logic circuits embodying the apparatus in Verilog, SystemVerilog, Chisel, or VHDL (Very High-Speed Integrated Circuit Hardware Description Language) as well as intermediate representations such as FIRRTL. Computer-readable code may provide definitions embodying the concept using system-level modelling languages such as SystemC and SystemVerilog or other behavioural representations of the concepts that can be interpreted by a computer to enable simulation, functional and/or formal verification, and testing of the concepts.
Additionally or alternatively, the computer-readable code may define a low-level description of integrated circuit components that embody concepts described herein, such as one or more netlists or integrated circuit layout definitions, including representations such as GDSII. The one or more netlists or other computer-readable representation of integrated circuit components may be generated by applying one or more logic synthesis processes to an RTL representation to generate definitions for use in fabrication of an apparatus embodying the invention. Alternatively or additionally, the one or more logic synthesis processes can generate from the computer-readable code a bitstream to be loaded into a field programmable gate array (FPGA) to configure the FPGA to embody the described concepts. The FPGA may be deployed for the purposes of verification and test of the concepts prior to fabrication in an integrated circuit or the FPGA may be deployed in a product directly.
The computer-readable code may comprise a mix of code representations for fabrication of an apparatus, for example including a mix of one or more of an RTL representation, a netlist representation, or another computer-readable definition to be used in a semiconductor design and fabrication process to fabricate an apparatus embodying the invention. Alternatively or additionally, the concept may be defined in a combination of a computer-readable definition to be used in a semiconductor design and fabrication process to fabricate an apparatus and computer-readable code defining instructions which are to be executed by the defined apparatus once fabricated.
Such computer-readable code can be disposed in any known transitory computer-readable medium (such as wired or wireless transmission of code over a network) or non-transitory computer-readable medium such as semiconductor, magnetic disk, or optical disc. An integrated circuit fabricated using the computer-readable code may comprise components such as one or more of a central processing unit, graphics processing unit, neural processing unit, digital signal processor or other components that individually or collectively embody the concept.
In the present application, the words “configured to . . . ” are used to mean that an element of an apparatus has a configuration able to carry out the defined operation. In this context, a “configuration” means an arrangement or manner of interconnection of hardware or software. For example, the apparatus may have dedicated hardware which provides the defined operation, or a processor or other processing device may be programmed to perform the function. “Configured to” does not imply that the apparatus element needs to be changed in any way in order to provide the defined operation.
Although illustrative embodiments of the invention have been described in detail herein with reference to the accompanying drawings, it is to be understood that the invention is not limited to those precise embodiments, and that various changes, additions and modifications can be effected therein by one skilled in the art without departing from the scope and spirit of the invention as defined by the appended claims. For example, various combinations of the features of the dependent claims could be made with the features of the independent claims without departing from the scope of the present invention.
Other aspects and features of the invention are set out in the following numbered clauses: