The present invention relates generally to binary multiplier circuits, and, in particular implementations, mixed sign binary multiplier methods and systems for multiplying signed or unsigned binary numbers.
With the advent of technologies that require complex and fast data processing, digital systems have evolved to perform large numbers of mathematical operations in less time. Since multiplication is a basic form of mathematical operation performed by a digital system, different algorithms, such as Booth's multiplication algorithm, have been developed to reduce the time for a digital system to multiply two numbers.
For a more complete understanding of the present invention, and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:
Corresponding numerals and symbols in the different figures generally refer to corresponding parts unless otherwise indicated. The figures are drawn to clearly illustrate the relevant aspects of the implementations and are not necessarily drawn to scale. The edges of features drawn in the figures do not necessarily indicate the termination of the extent of the feature.
The making and using of various implementations are discussed in detail below. It should be appreciated, however, that the various implementations described herein are applicable in a wide variety of specific contexts. The specific implementations discussed are merely illustrative of specific ways to make and use various implementations, and should not be construed in a limited scope. Throughout the discussion herein, unless otherwise specified, the same or similar reference numerals or labels in different figures refer to the same or similar component or signal.
Reference to “an implementation,” “one implementation,” “an embodiment,” or “one embodiment” in the framework of the present description is intended to indicate that a particular configuration, structure, or characteristic described in relation to the implementation/embodiment is included in at least one implementation/embodiment. Hence, phrases such as “in one implementation” or “in one embodiment” that may be present in one or more points of the present description do not necessarily refer to one and the same implementation/embodiment. Moreover, particular conformations, structures, or characteristics may be combined in any adequate way in one or more implementations/embodiments. The references used herein are provided merely for convenience and hence do not define the extent of protection or the scope of the implementations/embodiments.
Digital multipliers are fundamental elements in computer arithmetic, pivotal in a myriad of applications from scientific computing to signal processing. Over time, various techniques have emerged for efficiently executing binary multiplication. Methods like long multiplication, Booth's algorithm, and tree multipliers such as Wallace and Dadda trees have been developed to improve computational efficiency. There are also hardware-specific solutions like lookup tables and shift operations that offer advantages in specific scenarios.
Despite these advancements, existing multipliers have several drawbacks. They often suffer from latency issues that can hinder performance in real-time and high-speed applications. They also can be power-intensive, making them less ideal for battery-operated and portable systems. The hardware implementations for some of these efficient algorithms can consume significant silicon area, increasing manufacturing costs. Complexity and scalability are additional concerns, as many existing algorithms and hardware solutions are not easily adaptable to handle different operand sizes or a large number of bits without affecting performance or power consumption negatively.
As an example, for 4-bit multiplication, the 4-bit number (INT4) can be a signed or unsigned number. Depending on the signedness, the 4 bits represent a certain number. Table 1 below shows the possible number values depending on signedness for an INT4 number.
Thus, a 4-bit number can represent numbers in a range from −8 to 15. Such numbers are frequent in computation requirements for artificial intelligence (AI) Engines, central processing units (CPUs), graphics processing units (GPUs), etc. Although the examples primarily discussed in this disclosure are 4-bit integers (INT4), the present disclosure is not limited hereto, and other number types are within the scope of this disclosure. For example, multiplying INT8, INT16, 8-bit floating point (FP8), FP16, or the like are within the scope of this disclosure.
In conventional systems, multipliers are optimized for a specific signedness of inputs—e.g., there are multipliers that can multiply two signed numbers, or two unsigned numbers. Such multipliers do not work for other combinations of signedness. A 4-bit unsigned multiplier does not work for 4-bit signed inputs, similarly, a 4-bit signed multiplier does not work for 4-bit unsigned inputs.
In conventional systems, for multiplication with scenarios with combinations of signedness (e.g., a signed number and an unsigned number), the INT4 numbers are first converted to 5-bit signed numbers. 5-bit signed numbers have a range from −16 to 15 which can cover the required range of −8 to 15. The multiplication is then performed using 5-bit signed multiplication. Thus, the multipliers for the conventional systems have to be made at least one bit larger to attempt to account for the mixed signed multiplication.
As another example, in conventional systems, a single multiplier is designed to be used for multiple different datatypes. An example can be that the same multiplier needs to function for FP16 and INT8 datatypes. For FP16, an unsigned 11×11 multiplier is needed. For INT8, a signed 8×8 multiplier is needed. A signed 8×8 multiplier is not a subset of unsigned 11×11 multiplication. Thus, in conventional systems, a signed 12×12 multiplier would be required to handle both FP16 and INT8 datatypes-which adds area and power and diminishes the speed of the circuit.
In contrast to the conventional systems and methods previously discussed, the described methods and systems have the benefit that this multiplier circuit is more efficient in power, performance, and area compared to other known circuits. For example, the disclosed implementations of the mixed sign multiplier, when compared to current multipliers, reduces the required area by 11.6% while using 28% less power. Moreover, the disclosed multiplier is able to handle inputs of any signedness (both signed, both unsigned, and one signed and one unsigned).
Further, using the example above for a single multiplier to handle FP16 and INT8 datatypes, with the disclosed implementations, a single mixed sign 11×11 multiplier can be used—as signed 8×8 multiplication is a subset of this multiplier.
In addition, the correction factor of each multiplication need not be added to the partial products. A pre-computed correction factor for several multiplications can be added where the multiplication results are being summed. This further reduces the circuit size, improving power, performance, and area.
The disclosed implementations of the mixed sign multiplier may be used in an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), a CPU, a GPU, a general-purpose computing on GPU (GPGPU), an AI engine, a digital signal processing (DSP) system, a neural network, or the like.
Implementations provided below describe various methods and systems for a mixed sign multiplier, and in particular implementations, methods and systems for a mixed sign multiplier with low power and area requirements with improved performance. The following description describes the implementations.
A first N-bit partial product 21 is obtained by performing logical AND operations with the zeroth bit or least significant bit 12, of the second operand 12 and each of the N bits of the first operand 11. For example, as shown, the first partial product 21 is an N-bit number (in this example, a 4-bit number) having a least significant bit or zeroth bit 210 obtained by performing a logical AND operation with the zeroth bit 120 of the second operand 12 and the zeroth bit 110 of the first operand 11 as inputs. The first partial product 21 has a first bit 211 obtained by performing a logical AND operation with the zeroth bit 120 of the second operand 12 and the first bit 11, of the first operand 11 as inputs. A second bit 212 of the first partial product 21 is obtained by performing a logical AND operation with the zeroth bit 120 of the second operand 12 and the second bit 112 of the first operand 11 as inputs. An MSB or third bit 213 of the first partial product 21 is obtained by performing a logical AND operation with the zeroth bit 120 of the second operand 12 and the third bit 113 of the first operand 11 as inputs.
Each of the N-bit partial products may be obtained by a similar process as described above with respect to the first partial product 21, with each of the subsequent partial products being left-shifted one bit position with respect to the prior partial product, as will be described in further detail below.
A second N-bit partial product 22 may be obtained by performing logical AND operations with the N-2 bit (in this example, the first bit 121) of the second operand 12 and each of the N bits of the first operand 11. In the example shown in
The third N-bit partial product 23 may be obtained by performing logical AND operations with the N-1 bit (in this example, the second bit 122) of the second operand 12 and each of the N bits of the first operand 11. In the example shown in
The fourth N-bit partial product 24 may be obtained by performing logical AND operations with the N-1 bit (in this example, the third bit 123) of the second operand 12 and each of the N bits of the first operand 11. In the example shown in
Further, an exclusive or (XOR) function 51 is applied to each of the third bits of the partial products 21-24. In some implementations, the XOR function 51 applies an XOR logic function to each of the third bits of the partial products 21-24 (213, 223, 233, 243) with the signedness of the first operand (X). In addition, an XOR function 52 is applied to each of the bits of the fourth partial product 24. In some implementations, the XOR function 52 applies an XOR logic function to each of the bits of the fourth partial product 24 (240, 241, 242, 243) with the signedness of the second operand (Y).
Further, as shown in
More generally, for an N×M multiplication, the correction factor is (N+M) bits. It is given by Correction Factor={OR of signednesses, N-2 zeroes, AND of signednesses, XOR of signednesses, M-1 zeroes}.
In the illustrated example of
The product 40 is obtained by performing binary addition operation of all the partial products 21-24 and the correction factor 30. The partial products 21-24 are unsigned numbers while the correction factor is a signed number. The product 40 is unsigned if both operands 11,12 are unsigned. If any of the operands 11, 12 are signed numbers, the product 40 is also a signed number.
The zeroth bit 400 of the product 40 is obtained by performing a binary addition operation with the zeroth bit 21, of the partial product 21 and the correction factor bit 300 as inputs. Alternatively, the zeroth bit 400 of the product 40 may simply be provided as having the value of the zeroth bit 210 of the partial product 21, as the other bits in the and the correction factor bit 300 is zero.
The first bit 401 of the product 40 is obtained by performing a binary addition of the partial product bits 211 and 220 and the correction factor bit 301, and any carry value from the lower bit values, which is zero in this case. The lower bit of the sum is the value of the respective bit of the product 40 and the higher bit of the sum is forwarded as carry for next addition.
The second bit 402 of the product 40 is obtained by performing a binary addition of the partial product bits 212, 221, 230, and the correction factor bit 302, and any carry value from the lower bit values. The lower bit of the sum is the value of the respective bit of the product 40 and the higher bit of the sum is forwarded as carry for next addition.
The third bit 403 of the product 40 is obtained by performing a binary addition of the partial product bits 213, 222, 231, 240, and the correction factor bit 303, and any carry value from the lower bit values. The lower bit of the sum is the value of the respective bit of the product 40 and the higher bit of the sum is forwarded as carry for next addition.
The fourth bit 404 of the product 40 is obtained by performing a binary addition of the partial product bits 223, 232, 241, and the correction factor bit 304, and any carry value from the lower bit values. The lower bit of the sum is the value of the respective bit of the product 40 and the higher bit of the sum is forwarded as carry for next addition.
The fifth bit 405 of the product 40 is obtained by performing a binary addition of the partial product bits 233, 242, and the correction factor bit 305, and any carry value from the lower bit values. The lower bit of the sum is the value of the respective bit of the product 40 and the higher bit of the sum is forwarded as carry for next addition.
The sixth bit 406 of the product 40 is obtained by performing a binary addition of the partial product bit 243, the correction factor bit 306, and any carry value from the lower bit values. The lower bit of the sum is the value of the respective bit of the product 40 and the higher bit of the sum is forwarded as carry for next addition.
The seventh bit 407 of the product 40 is obtained by performing a binary addition of the correction factor bit 307 and any carry value from the lower bit values.
The product 40 is a 2N bit value, where N is the number of bits of the operands 11, 12. Thus, as shown in
As shown in
The first partial product circuit 111 includes a plurality of logic gates configured to generate the first partial product 121. In particular, a first AND logic gate 151 receives as input the zeroth bit x0 of the first operand X and the zeroth bit y0 of the second operand Y, and generates as output a zeroth bit p0 of the first partial product 121. A second AND logic gate 152 receives as input the first bit x1 of the first operand X and the zeroth bit y0 of the second operand Y, and generates as output a first bit p1 of the first partial product 121. A third AND logic gate 153 receives as input the second bit x2 of the first operand X and the zeroth bit y0 of the second operand Y, and generates as output a second bit p2 of the first partial product 121. A fourth AND logic gate 154 receives as input the third bit x3 of the first operand X and the zeroth bit y0 of the second operand Y. The output of the fourth logic gate 154 is provided as input to an XOR logic gate 155. The XOR logic gate 155 further receives as input the signedness of the first operand X (signedness_x), and generates as output a third bit p3 of the first partial product 121.
The first partial product 121 may be stored in any suitable computer-readable storage medium, including in a wire, register, computer-readable memory, or the like.
The second partial product circuit 112 includes a first AND logic gate 161 that receives as input the zeroth bit x0 of the first operand X and the first bit y, of the second operand Y, and generates as output a zeroth bit p0 of the second partial product 122. A second AND logic gate 162 receives as input the first bit x1 of the first operand X and the first bit y, of the second operand Y, and generates as output a first bit p1 of the second partial product 122. A third AND logic gate 163 receives as input the second bit x2 of the first operand X and the first bit y, of the second operand Y, and generates as output a second bit p2 of the second partial product 122. A fourth AND logic gate 164 receives as input the third bit x3 of the first operand X and the first bit y1 of the second operand Y. The output of the fourth logic gate 164 is provided as input to an XOR logic gate 165. The XOR logic gate 165 further receives as input the signedness of the first operand X (signedness_x), and generates as output a third bit p3 of the second partial product 122.
The second partial product 122 may be stored in any suitable computer-readable storage medium, including in a wire, register, computer-readable memory, or the like. Moreover, the second partial product 122 may be left-shifted by one position with respect to the first partial product 121, and in some implementations, the second partial product 122 may be stored in the left-shifted position with respect to the first partial product 121 or may be left-shifted during later operations to add the partial products.
The third partial product circuit 113 includes a first AND logic gate 171 that receives as input the zeroth bit x0 of the first operand X and the second bit y2 of the second operand Y, and generates as output a zeroth bit p0 of the third partial product 123. A second AND logic gate 172 receives as input the first bit x1 of the first operand X and the second bit y2 of the second operand Y, and generates as output a first bit p1 of the third partial product 123. A third AND logic gate 173 receives as input the second bit x2 of the first operand X and the second bit y2 of the second operand Y, and generates as output a second bit p2 of the third partial product 123. A fourth AND logic gate 174 receives as input the third bit x3 of the first operand X and the second bit y2 of the second operand Y. The output of the fourth logic gate 174 is provided as input to an XOR logic gate 175. The XOR logic gate 175 further receives as input the signedness of the first operand X (signedness_x), and generates as output a third bit p3 of the third partial product 123.
The third partial product 123 may be stored in any suitable computer-readable storage medium, including in a wire, register, computer-readable memory, or the like. The third partial product 123 may be left-shifted by one position with respect to the second partial product 122, and in some implementations, the third partial product 123 may be stored in the left-shifted position with respect to the second partial product 122 or may be left-shifted during later operations to add the partial products.
As shown, the first partial product circuit 111, the second partial product circuit 112, and the third partial product circuit 113 may be the same or substantially the same circuit arrangements. However, each of these circuits operates with different bits of operand input. More particularly, the first partial product circuit 111 multiplies the zeroth bit y0 of the second operand Y with each of the bits of the first operand X, the second partial product circuit 112 multiplies the first bit y, of the second operand Y with each of the bits of the first operand X, and the third partial product circuit 113 multiplies the second bit y2 of the second operand Y with each of the bits of the first operand X.
Similarly, the fourth partial product circuit 114 multiplies the third bit y3 of the second operand Y with each of the bits of the first operand X; however, the circuitry of the fourth partial product circuit 114 is different than that of the first, second, and third partial product circuits 111, 112, 113, as the fourth partial product circuit 114 further performs a logical XOR operation with the signedness of the second operand Y (signedness_y) for each of the output multiplication bits.
As shown, the fourth partial product circuit 114 includes a first AND logic gate 181 that receives as input the zeroth bit x0 of the first operand X and the third bit y3 of the second operand Y. A second AND logic gate 182 receives as input the first bit x1 of the first operand X and the third bit y3 of the second operand Y. A third AND logic gate 183 receives as input the second bit x2 of the first operand X and the third bit y3 of the second operand Y. A fourth AND logic gate 184 receives as input the third bit x3 of the first operand X and the third bit y3 of the second operand Y. The output of the fourth logic gate 184 is provided as input to a first XOR logic gate 185. The first XOR logic gate 185 further receives as input the signedness of the first operand X (signedness_x).
The fourth partial product circuit 114 further includes a plurality of XOR logic gates configured to perform logical XOR operations using the signedness of the second operand Y (signedness_y) and the outputs of each of the first, second, and third AND logic gates and the XOR logic gate 185 as inputs. More particularly, the fourth partial product circuit 114 includes a second XOR logic gate 186 that generates a zeroth bit p0 of the fourth partial product 124 based on the output of the first AND logic gate 181 and the signedness of the second operand Y (signedness_y). A third XOR logic gate 187 generates a first bit p1 of the fourth partial product 124 based on the output of the second AND logic gate 182 and the signedness of the second operand Y (signedness_y). A fourth XOR logic gate 188 generates a second bit p2 of the fourth partial product 124 based on the output of the third AND logic gate 183 and the signedness of the second operand Y (signedness_y). A fifth XOR logic gate 189 generates a third bit p3 of the fourth partial product 124 based on the output of the first XOR logic gate 185 and the signedness of the second operand Y (signedness_y).
The fourth partial product 124 may be stored in any suitable computer-readable storage medium, including in a wire, register, computer-readable memory, or the like. The fourth partial product 124 may be left-shifted by one position with respect to the third partial product 123, and in some implementations, the fourth partial product 124 may be stored in the left-shifted position with respect to the third partial product 123 or may be left-shifted during later operations to add the partial products.
The binary multiplication circuit 100 further includes the correction factor circuit 125 that generates the correction factor 130 based on the signedness of the first and second operands. The correction factor circuit 125 includes an XOR logic gate 191 configured to generate a zeroth bit cf0 of the correction factor 130 based on the signedness of the first and second operands (signedness_x, signedness_y). An AND logic gate 192 is configured to generate a first bit cf1 of the correction factor 130 based on the signedness of the first and second operands (signedness_x, signedness_y). An OR logic gate 193 is configured to generate a fourth bit cf4 of the correction factor 130 based on the signedness of the first and second operands (signedness_x, signedness_y).
Although two zeros are shown between the first bit cf1 and the fourth bit cf4 of the correction factor 130, it will be readily appreciated that any number of zeros may be positioned here depending on a number of bits of the first and second operands X, Y. Moreover, zeros may be inserted in other positions of the correction factor 130 in some implementations. More particularly, as previously described with respect to
The final multiplication result is obtained by binary addition of 121, 122, 123, 124, and 130.
Each of the inputs received by the various logic gates described herein may be digital signals that are transmitted, for example, via electrical wires or lines as illustrated.
The circuit 100 illustrated in
In various alternatives, the processor 402 includes a central processing unit (CPU), a graphics processing unit (GPU), a CPU and GPU located on the same die, or one or more processor cores, wherein each processor core can be a CPU or a GPU. In various alternatives, the memory 404 is located on the same die as the processor 402, or is located separately from the processor 402. The memory 404 includes a volatile or non-volatile memory, for example, random access memory (RAM), dynamic RAM, or a cache.
The storage 406 includes a fixed or removable storage, for example, a hard disk drive, a solid-state drive, an optical disk, or a flash drive. The input devices 408 include, without limitation, a keyboard, a keypad, a touch screen, a touch pad, a detector, a microphone, an accelerometer, a gyroscope, a biometric scanner, or a network connection (e.g., a wireless local area network card for transmission and/or reception of wireless IEEE 802 signals). The output devices 410 include, without limitation, a display, a speaker, a printer, a haptic feedback device, one or more lights, an antenna, or a network connection (e.g., a wireless local area network card for transmission and/or reception of wireless IEEE 802 signals).
The input driver 412 communicates with the processor 402 and the input devices 408, and permits the processor 402 to receive input from the input devices 408. The output driver 414 communicates with the processor 402 and the output devices 410, and permits the processor 402 to send output to the output devices 410. It is noted that the input driver 412 and the output driver 414 are optional components, and that the device 400 will operate in the same manner if the input driver 412 and the output driver 414 are not present. The output driver 416 includes an accelerated processing device (“APD”) 416 which is coupled to a display device 418. The APD accepts compute commands and graphics rendering commands from processor 402, processes those compute and graphics rendering commands, and provides pixel output to display device 418 for display. As described in further detail below, the APD 416 includes one or more parallel processing units to perform computations in accordance with a single-instruction-multiple-data (“SIMD”) paradigm. Thus, although various functionality is described herein as being performed by or in conjunction with the APD 416, in various alternatives, the functionality described as being performed by the APD 416 is additionally or alternatively performed by other computing devices having similar capabilities that are not driven by a host processor (e.g., processor 402) and provides graphical output to a display device 418. For example, it is contemplated that any processing system that performs processing tasks in accordance with a SIMD paradigm can perform the functionality described herein. Alternatively, it is contemplated that computing systems that do not perform processing tasks in accordance with a SIMD paradigm can also perform the functionality described herein.
Referring to
Various example implementations are provided in the following. Other implementations may be understood from the entirety of the specification as well as the claims filed herein.
Implementations may achieve advantages. The described methods and systems have the benefit that this mixed sign multiplier is more efficient in power, performance, and area compared to known multipliers. For example, the disclosed implementations of the mixed sign multiplier, when compared to current multipliers, reduces the required area by 11.6% while using 28% less power. Moreover, the disclosed multiplier is able to handle inputs of any signedness (both signed, both unsigned, or one signed and one unsigned). For example, with implementations of the disclosed methods and systems, a single multiplier can handle the multiplication of FP16 and INT8 datatypes, with the disclosed implementations, a single mixed sign 11×11 multiplier can be used—as signed 8×8 multiplication is a subset of this multiplier.
An implementation may include a method for performing a binary multiplication including receiving a first operand having N bits at an input interface of a digital multiplier circuit in the computing system, receiving a second operand having M bits at the input interface of the digital multiplier circuit, generating, by the digital multiplier circuit, M N-bit partial products by performing a logical AND operation with each of the N bits of the first operand and each of the M bits of the second operand, and generating first modified M partial products by modifying, by the digital multiplier circuit, Nth bits of the M partial products, generating second modified M partial products by modifying, by the digital multiplier circuit, the Mth first modified partial product, generating, by the digital multiplier circuit, a correction factor having (N+M) bits, generating, by the digital multiplier circuit, a product by summing the second modified partial products and the correction factor, and outputting the product from an output interface of the digital multiplier circuit, the product having (N+M) bits.
Implementations may include one or more of the following features. The method where generating first modified M partial products by modifying, by the digital multiplier circuit, the Nth bits of the M partial products includes performing a logical XOR of each of the Nth bits of the M partial products with a signedness of the first operand to generate the first modified M partial products. The method where the signedness of the first operand is a bit value of ‘1’ when the first operand is a signed number, and wherein the signedness of the first operand is a bit value of ‘0’ when the first operand is an unsigned number. The method where generating second modified M partial products by modifying, by the digital multiplier circuit, the Mth first modified partial product includes performing a logical XOR of each of the N bits of the Mth first modified partial product with a signedness of the second operand to generate the second modified M partial products. The method where the signedness of the second operand is a bit value of ‘1’ when the second operand is a signed number, and wherein the signedness of the second operand is a bit value of ‘0’ when the second operand is an unsigned number. The method where receiving the first operand includes receiving a signed number or an unsigned number, and wherein receiving the second operand includes receiving a number having the other signedness option as the first operand. The method further including left-shifting each of the M partial products one bit position with respect to a prior partial product. The method where generating, by the digital multiplier circuit, a correction factor having (N+M) bits includes generating a bit mask of the (N+M) bits for the correction factor, the bit mask being generated according to Correction Factor={OR of signednesses of the first and second operands, N-2 zeroes, AND of signednesses of first and second operands, XOR of signednesses of the first and second operands, M-1 zeroes}. The method where the computing system includes an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), a CPU, a GPU, a general-purpose computing on GPU (GPGPU), an AI engine, a digital signal processing (DSP) system, or a neural network.
An implementation may include a device for performing a binary multiplication, the device including a digital multiplier circuit configured to: receive a first operand having N bits at an input interface, receive a second operand having N bits at the input interface, generate N partial products by performing a logical AND operation with each of the N bits of the first operand and each of the N bits of the second operand, generate first modified N partial products by modifying Nth bits of the N partial products, generate second modified N partial products by modifying the Nth partial product, generate a correction factor having 2N bits, generate a product by summing the second modified partial products and the correction factor, and output the product from an output interface, the product having 2N bits.
Implementations may include one or more of the following features. The device where the digital multiplier circuit is further configured to perform a logical XOR of each of the Nth bits of the N partial products with a signedness of the first operand to generate the first modified N partial products. The device where the digital multiplier circuit is further configured to perform a logical XOR of each of the N bits of the Nth partial product with a signedness of the second operand to generate the second modified N partial products. The device where the digital multiplier circuit is further configured to receive the first operand being a signed number or an unsigned number, and the second operand being the other signedness option as the first operand. The device where the digital multiplier circuit is further configured to left-shift each of the N partial products by one bit position with respect to a prior partial product. The device where the digital multiplier circuit is further configured to generate a bit mask of the 2N bits for the correction factor, the bit mask being generated by according to Correction Factor={OR of signednesses of the first and second operands, N-2 zeroes, AND of signednesses of first and second operands, XOR of signednesses of the first and second operands, N-1 zeroes}. The device where the device includes an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), a CPU, a GPU, a general-purpose computing on GPU (GPGPU), an AI engine, a digital signal processing (DSP) system, or a neural network.
An implementation may include a binary multiplication circuit including a plurality of partial product circuits configured to generate partial products of a first operand and a second operand, each of the partial product circuits including AND operators with each of the AND operators having inputs including a corresponding one of bits of the second operand and each of bits of the first operand, a first partial product modifier circuit including a plurality of XOR operators, each of the XOR operators of the first partial product modifier circuit having two inputs, each of first inputs being a signedness of the first operand, each of second inputs being an output of a most significant bit AND operator of a corresponding partial product circuit, a second partial product modifier circuit including a plurality of XOR operators, each of the XOR operators of the second partial product modifier circuit having two inputs, each of first inputs being a signedness of the second operand, one of the plurality of XOR operators of the second partial product modifier circuit having a second input of an output of an XOR operator of the first partial product modifier circuit associated with one of the plurality of partial product circuits, each of the XOR operators of the second partial product modifier circuit having second inputs of an output of corresponding bits of one of the plurality of partial product circuits, and a binary adder circuit configured to add outputs of the partial product circuits and the first and second partial product modifier circuits.
Implementations may include one or more of the following features. The binary multiplication circuit including a correction factor circuit configured to generate a bit mask of (N+M) bits, the bit mask being {OR of the signednesses of the first and second operands, N-2 zeroes, AND of the signednesses of first and second operands, XOR of the signednesses of the first and second operands, M-1 zeroes}, where the binary adder circuit is configured to add the outputs of the partial product circuits, the first and second partial product modifier circuits, and the correction factor circuit, wherein the first operand is N-bits and the second operand is M bits. The binary multiplication circuit where the first operand includes a signed number or an unsigned number, and wherein the second operand includes a number having the other signedness option as the first operand. The binary multiplication circuit where the adder circuit further includes a left-shifting circuit configured to left-shift each of the partial products one bit position with respect to a prior partial product.
While this invention has been described with reference to illustrative implementations, this description is not intended to be construed in a limiting sense. Various modifications and combinations of the illustrative implementations, as well as other implementations of the invention, will be apparent to persons skilled in the art upon reference to the description. It is therefore intended that the appended claims encompass any such modifications or implementations.