The present disclosure relates generally to integrated circuit (IC) devices such as processors, application specific integrated circuits (ASICs), and programmable logic devices (PLDs) that include a hardened multiplier circuit with area-and/or power-efficient partial product generation circuitry.
This section is intended to introduce the reader to various aspects of art that may be related to various aspects of the present disclosure, which are described and/or claimed below. This discussion is believed to be helpful in providing the reader with background information to facilitate a better understanding of the various aspects of the present disclosure. Accordingly, it may be understood that these statements are to be read in this light, and not as admissions of prior art.
Integrated circuits are ubiquitous in modern electronics. Many integrated circuit operations involve multiplying two values to obtain a product using a multiplier circuit. Artificial intelligence (AI) in particular involves so many multiplier instances that there may be millions of multiplier circuits or more per device. Thus, multiplier circuits may consume significant area and energy resources of an integrated circuit.
Various aspects of this disclosure may be better understood upon reading the following detailed description and upon reference to the drawings in which:
One or more specific embodiments will be described below. In an effort to provide a concise description of these embodiments, not all features of an actual implementation are described in the specification. It should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another. Moreover, it should be appreciated that such a development effort might be complex and time consuming, but would nevertheless be a routine undertaking of design, fabrication, and manufacture for those of ordinary skill having the benefit of this disclosure.
When introducing elements of various embodiments of the present disclosure, the articles “a,” “an,” and “the” are intended to mean that there are one or more of the elements. The terms “comprising,” “including,” and “having” are intended to be inclusive and mean that there may be additional elements other than the listed elements. Additionally, it should be understood that references to “one embodiment” or “an embodiment” of the present disclosure are not intended to be interpreted as excluding the existence of additional embodiments that also incorporate the recited features.
This disclosure relates to efficient multiplier circuitry that may be used in any suitable integrated circuit that performs an operation that multiplies two values. By way of example, the multiplier circuit may be included in a processor (e.g., a central processing unit (CPU) or a graphics processing unit (GPU)), an application specific integrated circuit (ASIC) (e.g., a specialized artificial intelligence (AI) integrated circuit), or a programmable logic device (PLD) (e.g., in a digital signal processing (DSP) block of a field programmable gate array (FPGA) integrated circuit). The improved multiplier circuitry of this disclosure may increase performance/power/area (PPA) for digital multiplication by up to 20% over previous circuits. Indeed, the efficient multiplier circuit of this disclosure has been validated over thousands of experiments on recent process nodes. Experiments show that the results are particularly applicable to relatively lower precision (8 to 12 bit operands) multipliers used for AI and at the speed ranges (500 MHz to 2 GHz) at which those chips often run.
With the foregoing in mind,
In a configuration mode of the integrated circuit device 12, a designer may use an electronic device 13 (e.g., a computer) to implement high-level designs (e.g., a system user design) using design software 14, such as a version of INTEL® QUARTUS® by INTEL CORPORATION. The electronic device 13 may use the design software 14 and a compiler 16 to convert the high-level program into a lower-level description (e.g., a configuration program, a bitstream). The compiler 16 may provide machine-readable instructions representative of the high-level program to a host 18 and the integrated circuit device 12. The host 18 may receive a host program 22 that may be implemented by the kernel programs 20. To implement the host program 22, the host 18 may communicate instructions from the host program 22 to the integrated circuit device 12 via a communications link 24 that may be, for example, direct memory access (DMA) communications or peripheral component interconnect express (PCIe) communications. In some embodiments, the kernel programs 20 and the host 18 may enable configuration of circuits including programmable logic blocks 110 and digital signal processing (DSP) blocks 120 on the integrated circuit device 12. The programmable logic blocks 110 may include circuitry and/or other logic elements and may be configurable to implement a variety of functions in combination with digital signal processing (DSP) blocks 120.
The DSP blocks 120 may include circuitry to carry out operations that involve multiplication, such as to perform multiply-accumulate operations or matrix-matrix or matrix-vector multiplication. The integrated circuit device 12 may include many (e.g., hundreds or thousands) of the DSP blocks 120. Additionally, the DSP blocks 120 may be communicatively coupled to another such that data output from one DSP block 120 may be provided to other DSP blocks 120. A DSP block 120 may include hardened arithmetic circuitry that is purpose-built for performing arithmetic operations. The hardened arithmetic circuitry of the DSP blocks 120 may be contrasted with arithmetic circuitry that may be constructed in soft logic in the programmable logic circuitry (e.g., the programmable logic blocks 110). While circuitry for performing the same arithmetic operations may be programmed into the programmable logic circuitry (e.g., the programmable logic blocks 110), doing this may take up significantly more die area, may consume more power, and/or may consume more processing time.
The designer may use the design software 14 to generate and/or to specify a low-level program, such as the low-level hardware description languages described above. Further, in some embodiments, the system 10 may be implemented without a separate host program 22. Thus, embodiments described herein are intended to be illustrative and not limiting.
An illustrative embodiment of a programmable integrated circuit device 12 such as a programmable logic device (PLD) that may be configured to implement a circuit design is shown in
Programmable logic circuitry of the integrated circuit device 12 may include programmable memory elements, which are sometimes referred to as configuration random access memory (CRAM). The memory elements may be loaded with configuration data (also called programming data or configuration bitstream) using input-output elements (IOEs) 102. Once loaded, the memory elements each provide a corresponding static control signal that controls the operation of an associated functional block (e.g., LABs 110, DSP 120, RAM 130, or input-output elements 102).
In one scenario, the outputs of the loaded memory elements are applied to the gates of metal-oxide-semiconductor transistors in a functional block to turn certain transistors on or off and thereby configure the logic in the functional block including the routing paths. Programmable logic circuit elements that may be controlled in this way include parts of multiplexers (e.g., multiplexers used for forming routing paths in interconnect circuits), look-up tables, logic arrays, AND, OR, NAND, and NOR logic gates, pass gates, etc.
The memory elements may use any suitable volatile and/or non-volatile memory structures such as random-access-memory (RAM) cells, fuses, antifuses, programmable read-only-memory memory cells, mask-programmed and laser-programmed structures, combinations of these structures, etc. Because the memory elements are loaded with configuration data during programming, the memory elements are sometimes referred to as configuration memory, configuration random-access memory (CRAM), or programmable memory elements. Programmable logic device (PLD) 100 may be configured to implement a custom circuit design. For example, the configuration RAM may be programmed such that LABs 110, DSP 120, and RAM 130, programmable interconnect circuitry (i.e., vertical channels 140 and horizontal channels 150), and the input-output elements 102 form the circuit design implementation.
In addition, the programmable logic device may have input-output elements (IOEs) 102 for driving signals off of the integrated circuit device 12 and for receiving signals from other devices. Input-output elements 102 may include parallel input-output circuitry, serial data transceiver circuitry, differential receiver and transmitter circuitry, or other circuitry used to connect one integrated circuit to another integrated circuit.
The integrated circuit device 12 may also include programmable interconnect circuitry in the form of vertical routing channels 140 (i.e., interconnects formed along a vertical axis of the integrated circuit 100) and horizontal routing channels 150 (i.e., interconnects formed along a horizontal axis of the integrated circuit 100), each routing channel including at least one track to route at least one wire. If desired, the interconnect circuitry may include pipeline elements, and the contents stored in these pipeline elements may be accessed during operation. For example, a programming circuit may provide read and write access to a pipeline element.
Note that routing topologies other than the topology of the interconnect circuitry depicted in
The integrated circuit device 12 may be programmed to perform a wide variety of operations. Indeed, many system designs that may be programmed into the integrated circuit device 12 may leverage the efficiency of performing arithmetic operations using the DSP blocks 120.
To generate each partial product, the decoding circuitry 184 may generate a code based on the value of a set of bits of the multiplier (B) 182. Direct radix 4 decoding looks at only two bits of the multiplier (B) 182 at a time, with four possible codings: 0 (binary “00”), 1 (binary “01”), 2(binary “10”) or 3 (binary “11”), all of which may be expressed in positive terms. The decoding circuitry 184 provides the code to partial product multiplexers (MUXes) 186 that may select the partial product as 0 for code 0 (binary “00”), A (the multiplicand (A) 180) for code 1 (binary “01”), 2A (double the multiplicand (A) 180) for code 2 (binary “10”), or 3A (triple the multiplicand (A) 180) for code 3 (binary “11”). Shifter circuitry (A, 2A) 188 may provide the value A by passing the multiplicand (A) 180 or the value 2A by doubling the multiplicand (A) 180 using any suitable circuitry (e.g., by shifting and adding a 0 constant on the least significant bit). Tripler circuitry (3A) 190 may provide the value 3A by tripling the multiplicand (A) 180. The partial products may be added together using any suitable addition. Here, this is shown to be accomplished by compressor and carry propagate adder circuitry 192, the result of which is a product 194 representing the value A multiplied by the value B.
By using direct radix 4 decoding, the decoding circuitry 184 may be simplified and implemented using comparators rather than more complex logic that is often used for other decoding schemes. Indeed, consider the difference between direct radix 4 decoding and Booth's radix 4 decoding. With Booth's radix 4 decoding, two bits, and a third which is the most significant bit of the previous dibit, are used to encode the multiplicand into one of 8 cases, which are 0, 1, 2, −2, −1, 0 (or can sometimes be coded as negative 0). Two of the cases code +1, and two code −1. By comparison, direct radix 4 may avoid the use of negative coding. Indeed, one complication of the Booth's radix 4 coding method is that the generation of the negative numbers involves several potential ‘1’ bits to be added into the result (usually in the compression tree) to create the 2's complements of the multiplicand-and this is true even for unsigned operands.
Direct radix 4 decoding is also simpler than Booth's radix 8. Booth's radix 8 takes 4 bits (3 bits plus the most significant bit of the previous 3 bits) to encode 16 combinations, which result in 9 multiplicand values (0, +1, +2, +3, +4, −4, −3, −2, −1, and 0/ −0). The difficult aspect of Booth's radix 8 coding is that a +3 multiplication value may be calculated as a partial product, and the 2's complement value must be completed before compression. This is the equivalent of A+2A, which is a deep logic structure. But the advantage of Booth's radix 8 is that the number of partial products is decreased over Booth's radix 4. This may or may not result in a smaller and/or faster multiplier.
One reason that the multiplier circuit architecture of
Moreover, because the multiplier circuit 160 only uses positive multiples of the multiplicand (A) 180 for partial products, there is no overhead with the adding of a vector of ‘1’s to turn the negation of them into 2's complement values. This applies even in the case that the multiplicand (A) 180 is signed, in which case the partial product is just sign extended. There are therefore two especially notable aspects of the multiplier circuit 160 shown in
Indeed, rather than compute a triple of the multiplicand (A) 180 as a single vector, the tripler circuitry (3A) 190 may include circuitry to generate the triple of the multiplicand value that sums at least two different vectors. For example, rather than use a single large carry prefix tree, the tripler circuitry (3A) 190 may use a carry prefix tree that is decomposed into multiple smaller carry prefix trees, as will be discussed further below. This makes the tripler circuitry (3A) 190 a much smaller, faster, energy-efficient circuit. This effect may be scaled to any suitable size input multiplicand value. What is more, the tripler circuitry (3A) 190 may operate using unsigned values, meaning that sign values may be simply passed along. Indeed, for simple (sign, value) representations of signed numbers, the tripler circuitry (3A) 190 may have exactly the same architecture, except that the sign may be passed aside. For instance, if the input is (−, 5), then the output is (−, 15); if the input is (+,8) then the output is (+,24). If working in 2's-complement representation, then the value into the tripler circuitry (3A) 190 may be entered “as is”, with no changes into the tripler circuitry (3A) 190.
For ease of explanation, this disclosure will focus on an 8-bit example (
This truth table may be used to build the various components of the tripler circuitry (3A) 190A. A gate-level example of the tripler circuitry (3A) 190A appears in
A gate-level example of the tripler circuitry (3A) 190B appears in
The 8-bit or 9-bit cases are among the likely use cases, as relatively smaller precisions are often used for AI applications, where it is likely that millions of multipliers may be on a single device. More generally, however, the tripler circuitry (3A) 190 may be designed to accommodate any suitable bit depth. The following explanation will provide a step-by-step description of how the tripler circuitry (3A) 190 may be designed. First, consider that one Common way to envision multiplication by 3 is 3*A=(2+1)*A=2*A+A. This way is usually used to build 3*A circuit: multiplication by 2 is simply shift, and adder circuitry is readily available. To construct the tripler circuitry (3A) 190, however, another way is used−it is essentially a school-book multiplication scheme for a radix 4 numeric system.
Start with representing A in the radix 4 numeric system. Without loss of generality, it may be assumed that A contains an even number of binary digits (we can add one zero upfront if the number of digits is odd), and this number is 2*N:
In this formula, ai are individual binary digits of A (bits).
Let's group binary digits in pairs: aaj=a2j+1* 2+a2j. Then we can write A representation in radix 4 numeric system in the following way:
Let's write formula for 3*A:
Denote individual digit multiplication values as follows: yyj=(3·aaj)/4, xxj=(3·aaj) % 4. In this formula (and below), a division operation is integer division and the percent operator is integer modulo reduction. The expression 3*A may be rewritten using this notation:
Denote:
This produces:
Here is an example of 3·2926=8778 computation for 3·A=4·Y+X formula in radix 4 numeric system (we are using binary representation for radix 4 digits: 3 is 11, 2 is 10, 1 is 01, 0 is 00):
A notable observation is that X and Y are not arbitrary numbers—they have patterns of digits which “snap” together nicely. This allows construction of addition circuitry for particular numbers which is simpler than a general-purpose adder. The i-th radix 4 digit of the result may be denoted as zzi, in which case:
Denote the sum of 4·Y value and radix 4 adder 4·Y+X carry digit string (“carry” line of Table 2) as 4·C. Then the example can be rewritten as follows:
Note that formula 3·A=4·C+X is not correct (this is because carry would be accounted twice). But it is true that zzi=(xxi+cci−1) % 4, where cci are radix 4 digits of C. The resulting radix 4 digits may be calculated similar to schoolbook multiplication algorithms.
For example, a truth table may be written for 3·aa0:
For the first result digit, the following formula is true: zz0=xx0=(3·aa0) % 4. For the first radix 4 carry digit, the following formula is true cc0=yy0=(3·a)/4. Note that the value cc0 cannot be binary “11” (base ten “3”).
Next, compute zz1. The following computations produce the second digit:
Thus, zz1=(xx1+cc0) % 4 and cc1=yy1+(xx1+cc0)/4. It may be proven that cc1 cannot be binary “11.” It is known that cc0≤2, and 3·aa1≤9, so cc0+3·aa1≤11. So cc1=(cc0+3·aa1)/4<(12/4)=3, or, writing shortly, cc1<3. By induction, this means that for all i, 0≤i≤N, cci<3. The truth table for the second digit therefore may be written as follows:
Considering that cci value cannot be 11 binary, the truth table for all digits starting from second one would be the same.
Notably, the carry digits cci can be directly computed without an intermediate yyi and xxi computation step. Both xxi and yyi are functions of aai, so cci can be computed as a function of aai and cci−1. Here is a truth table for cci (it is three columns out of previous truth table put in different order):
Denote the first binary digit of cci as c11, and second binary digit as c2i: c2i=cci/2, c1i=cci % 2. Denote sum of cci binary digits as c3i: c3i=c2i+c1i. If values c3i and c2i, are available, then c1i can be computed c1i=c3i−c2i, and cci can be computed cci=c1i+2·c2i. Denote a2i=aai/2, a1i=aai % 2. A truth table for c3i and c2i may be expressed as follows:
Boolean algebra formulas for c2i and c3i based on the truth table may be written:
With these values denoted as follows:
gi2=ai2& ai1, pi2=ai2, gi3=ai2, pi3=ai1
Values for c2i and c3i may be rewritten:
Note that these formulas are identical to the formulas that appear in carry look ahead adder construction. This means that parallel prefix tree construction can be used for c2i and c3i computation (the difference is only in initial generate and propagate signal formulas). This gives cci carry digit computations, and resulting digit computation zzi=(xxi+cci−1) % 4. In this way, the tripler circuitry of this disclosure may be built for any desired bit depth.
The multiplier circuit of this disclosure thus may use direct radix 4 and/or tripler circuitry to efficiently multiply two values. A flowchart 240 shown in
An integrated circuit including the multiplier circuitry of this disclosure may be a component included in a data processing system, such as a data processing system 500, shown in
The data processing system 500 may be part of a data center that processes a variety of different requests. For instance, the data processing system 500 may receive a data processing request via the network interface 506 to perform encryption, decryption, machine learning, video processing, voice recognition, image recognition, data compression, database search ranking, bioinformatics, network security pattern identification, spatial navigation, digital signal processing, or other specialized tasks.
While the embodiments set forth in the present disclosure may be susceptible to various modifications and alternative forms, specific embodiments have been shown by way of example in the drawings and have been described in detail herein. However, it should be understood that the disclosure is not intended to be limited to the particular forms disclosed. The disclosure is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the disclosure as defined by the following appended claims.
The techniques presented and claimed herein are referenced and applied to material objects and concrete examples of a practical nature that demonstrably improve the present technical field and, as such, are not abstract, intangible or purely theoretical. Further, if any claims appended to the end of this specification contain one or more elements designated as “means for [perform]ing [a function] . . . ” or “step for [perform]ing [a function] . . . ”, it is intended that such elements are to be interpreted under 35 U.S.C. 112(f). However, for any claims containing elements designated in any other manner, it is intended that such elements are not to be interpreted under 35 U.S.C. 112(f).
EXAMPLE EMBODIMENT 1. Multiplier circuitry to multiply a multiplicand value with a multiplier value, the multiplier circuitry comprising:
EXAMPLE EMBODIMENT 2. The multiplier circuitry of example embodiment 1, wherein the circuitry to generate the triple of the multiplicand value that sums the two different vectors sums the two different vectors in at least two parallel summations.
EXAMPLE EMBODIMENT 3. The multiplier circuitry of example embodiment 2, wherein the circuitry to generate the triple of the multiplicand value that sums the two different vectors comprises at least two parallel carry prefix trees.
EXAMPLE EMBODIMENT 4. The multiplier circuitry of example embodiment 1, wherein at least one of the different vectors is coded in a way that a carry value cannot propagate beyond a specified number of bits during the summation of the different vectors.
EXAMPLE EMBODIMENT 5. The multiplier circuitry of example embodiment 1, wherein the decoding scheme comprises direct radix 4 coding.
EXAMPLE EMBODIMENT 6. The multiplier circuitry of example embodiment 1, wherein the decoding scheme comprises Booth's radix 8 coding.
EXAMPLE EMBODIMENT 7. The multiplier circuitry of example embodiment 1, wherein the decoding circuitry consists essentially of comparators.
EXAMPLE EMBODIMENT 8. The multiplier circuitry of example embodiment 1, wherein the carry prefix trees have an identical architecture but receive different inputs.
EXAMPLE EMBODIMENT 9. The multiplier circuitry of example embodiment 1, wherein the carry prefix trees have a Sklansky, Brent-Kung, Kogge-Stone, Ladner-Fisher, or Han-Carson architecture.
EXAMPLE EMBODIMENT 10. One or more tangible, non-transitory, machine-readable media comprising instructions that, when executed by a data processing system, cause the data processing system to generate a circuit design comprising tripler circuitry to triple a multiplicand value, the tripler circuitry comprising:
EXAMPLE EMBODIMENT 11. The one or more tangible, non-transitory, machine-readable media of example embodiment 9, wherein the first prefix tree has an identical structure to the second prefix tree.
EXAMPLE EMBODIMENT 12. The one or more tangible, non-transitory, machine-readable media of example embodiment 9, wherein the first prefix tree or the second prefix tree, or both, have a Sklansky, Brent-Kung, Kogge-Stone, Ladner-Fisher, or Han-Carson architecture.
EXAMPLE EMBODIMENT 13. The one or more tangible, non-transitory, machine-readable media of example embodiment 9, wherein the combining circuitry is to produce the third set of carry bits based on the first set of carry bits, the second set of carry bits, and a most significant bit of the multiplicand.
EXAMPLE EMBODIMENT 14. An integrated circuit device having multiplier circuitry to multiply a multiplicand value with a multiplier value, the multiplier circuitry comprising:
EXAMPLE EMBODIMENT 15. The integrated circuit device of example embodiment 14, wherein the direct radix 4 decoding circuitry is to produce a binary 00, 01, 10, or 11 code based on the portion of the multiplier value and the partial product multiplexing circuitry is to select as a partial product a value of 0 when the code is 00, the single multiplicand value when the value is 01, the double multiplicand value when the value is 10, and the triple multiplicand value when the value is 11.
EXAMPLE EMBODIMENT 16. The integrated circuit device of example embodiment 14, wherein the direct radix 4 decoding circuitry comprises a plurality of comparators to produce the one of the four codes based on the portion of the multiplier value.
EXAMPLE EMBODIMENT 17. The integrated circuit device of example embodiment 14, comprising tripler circuitry to generate the triple multiplicand value, wherein the tripler circuitry comprises a plurality of parallel carry prefix trees.
EXAMPLE EMBODIMENT 18. The integrated circuit device of example embodiment 17, wherein at least two of the plurality of parallel carry prefix trees are identical.
EXAMPLE EMBODIMENT 19. The integrated circuit device of example embodiment 14, wherein the integrated circuit device comprises a field programmable gate array (FPGA) having hardened digital signal processing (DSP) circuitry that comprises the multiplier circuitry.
EXAMPLE EMBODIMENT 20. The integrated circuit device of example embodiment 14, wherein the integrated circuit device comprises a central processing unit (CPU) or graphics processing unit (GPU).