The present invention relates to computing technology, and particularly to improvement to a fixed-value multiplier used by computing systems, where the improvement is achieved by using a field-programmable gate array (FPGA).
An FPGA is an integrated circuit designed to be configured by a customer or a designer after manufacturing. The FPGA configuration is generally specified using a hardware description language (HDL), similar to that used for an application-specific integrated circuit (ASIC). Computing a multiplication of two numbers is a common operation performed by various computing systems. Several computing systems require a multiplication in which one of the values is known or fixed, and the second is a dynamic input value.
According to one or more embodiments of the present invention, a method for multiplying two binary numbers includes configuring, in an integrated circuit, a plurality of lookup tables based on a known binary number (w). The lookup tables can be configured in three layers. The method further includes receiving, by the integrated circuit, an input binary number (d). The method further includes determining, by the integrated circuit, a multiplication result (p) of the known binary number w and the input binary number d by determining each bit (pi) from p using the lookup tables based on specific combination of bits from the known binary number w and from the input binary number d, wherein a notation jx represents the xth bit of j from the right, with bit j0 being the rightmost bit of j.
In one or more embodiments of the present invention, the known binary number w has a predetermined number of bits. For example, the known binary number is an 8-bit binary number. Further, in one or more embodiments of the present invention, the input binary number has a predetermined number of bits. For example, the input binary number is a 12-bit binary number.
In one or more embodiments of the present invention, the bits p5, p4, p3, p2, p1, and p0 of p are determined by a first circuit that includes a first layer of the lookup tables from the integrated circuit based on the bits d5, d4, d3, d2, d1, and do of the input binary number d. Further, the bits p8, p7, and p6, of p are determined by a second circuit from the integrated circuit based on the first set of auxiliary bits computed by the first circuit. The second circuit includes a second layer of the lookup tables. Further yet, the bits p16, p15, p14, p13, p12, p11, and p10 of p are determined by a third circuit from the integrated circuit based on auxiliary bits computed by the second circuit. The third circuit includes a third layer of the lookup tables. In one or more embodiments of the present invention, determining the bit p19 of p includes determining, using a subset of lookup tables, that t≤d, wherein t=┌219/w┐ is precomputed, and in response to t≤d, p19 is set to 1, and otherwise p19 is set to 0.
In one or more embodiments of the present invention, determining the bit p18 of p includes precomputing threshold values t01, t10, and t11:
t
01=┌218/w┐,
t
10=└(219−1)/w┘, and
t
11=┌(219+218)/w┐.
In response to (t11≤d) or (t01≤d≤t10), p18 is set to 1, and otherwise to 0.
Further, in one or more embodiments of the present invention, determining the bit p17 of p includes precomputing threshold values:
t
001=┌217/w┐,
t
010=└(218−1)/w┘,
t
011=┌(218+217)/w┐,
t
100=└(219−1)/w┘,
t
101=┌(219+217)/w┐,
t
110=└(219+218−1)/w┘, and
t
111=┌(219+218+217)/w┐.
P17 is set to 1 in response to t111≤d, t101≤d≤t110, t011≤d≤t100, and t001≤d≤t010, and to 0 otherwise.
The technical solutions described herein can also be achieved by implementing a system that includes a memory device that stores a known binary number (w), and a multiplication circuit that performs the method to determine the multiplication result (p) of the known binary number with an input binary number (d) that is received dynamically.
Alternatively, in one or more embodiments of the present invention, a neural network system includes a multiplication circuit for performing a method to determine a multiplication result of a weight value with an input value (d) that is received dynamically, the method including configuring several lookup tables in an integrated circuit based on the weight value (w) that is a known value. The lookup tables can be configured in three layers. The method further includes determining a multiplication result (p) of the weight value w and the input value d by determining each bit (pi) from p using the lookup tables based on a specific combination of bits from the weight value w and from the input value d, wherein a notation jx represents the xth bit of j from the right, with bit j0 being the rightmost bit of j.
In yet another embodiment of the present invention, an electronic circuit determines a multiplication result (p) of a weight value (w) and an input value (d) that is received dynamically. Determining the multiplication result includes configuring several lookup tables based on the weight value (w), and determining each respective bit (pi) of the multiplication result (p) using the lookup tables based on specific combination of bits from the weight value w and from the input value d. The notation jx represents the xth bit of j from the right, with bit j0 being the rightmost bit of j.
In another embodiment of the present invention, a field programmable gate array includes several lookup tables, wherein the field programmable gate array performs a method for determining a multiplication result (p) of a weight value (w) and an input value (d) that is received dynamically. The lookup tables can be configured in three layers.
Embodiments of the present invention can include various other implementations such as machines, devices, and apparatus.
Additional technical features and benefits are realized through the techniques of the present invention. Embodiments and aspects of the invention are described in detail herein and are considered a part of the claimed subject matter. For a better understanding, refer to the detailed description and to the drawings.
The specifics of the exclusive rights described herein are particularly pointed out and distinctly claimed in the claims at the conclusion of the specification. The foregoing and other features and advantages of the embodiments of the invention are apparent from the following detailed description taken in conjunction with the accompanying drawings in which:
The diagrams depicted herein are illustrative. There can be many variations to the diagrams or the operations described therein without departing from the spirit of the invention. For instance, the actions can be performed in a differing order, or actions can be added, deleted, or modified. Also, the term “coupled” and variations thereof describe having a communications path between two elements and do not imply a direct connection between the elements with no intervening elements/connections between them. All of these variations are considered a part of the specification.
Exemplary embodiments of the present invention provide improved efficiency for computing a multiplication in computing systems, particularly in the case where one of the values to be multiplied is a fixed (known) value, and the second value to be multiplied is dynamically input. Exemplary embodiments of the present invention provide a multiplication circuit for performing such a computation efficiently. The values are represented in digital format using binary numbers. In one or more embodiments of the present invention, a field-programmable gate array (FPGA) includes a plurality of lookup tables (LUTs), the LUTs being configured in n layers to realize the multiplication circuit.
As a brief introduction, FPGAs typically contain an array of programmable logic blocks and a hierarchy of reconfigurable interconnects that allow the blocks to be “wired together,” like several logic gates that can be coupled together in different configurations. Logic blocks can be configured to perform combinational functions, logic gates like AND and XOR, and other functions. The logic blocks also include memory elements, such as flip-flops or more complete blocks of memory. It should be noted that FPGAs can include components different from those described herein; the above is an exemplary FPGA.
A technical challenge with computing systems is improving the time required for the computing system to perform calculations such as multiplication of numbers represented as binary numbers. Technical solutions provided by embodiments of the present invention address such technical challenges by providing an n layered multiplication circuit that performs a multiplication in deterministic time for two binary numbers—one fixed-value number and one variable number that is input at runtime. One or more embodiments of the present invention use FPGAs to implement the multiplication circuit using LUTs. As used herein, a “k-to-1 Boolean function” is implemented as a k-input LUT that provides a 1-bit output given a k-bit input.
Further, the present document denotes B={0,1}, Bn is the set of all n-tuples of zeros and ones, and Bn is the set of all Boolean functions Bn→B. Also, the present document uses the same symbol x interchangeably to denote both a Boolean vector x=(x0, . . . , xk-1)∈Bk and a natural number x=Σi=0k-1xi2k. The operation of the addition of two Boolean vectors, including two Boolean scalars, is denoted without confusion by the same ‘+’ sign. Accordingly, the technical challenge restated using the terminology just established is to compute the product of an input value d and a fixed-value weight w.
In one or more embodiments of the present invention, the computing system 110 can be an artificial neural network system. Alternatively, the computing system 110 is a desktop computer, a server computer, a tablet computer, or any other type of computing device that uses the multiplication circuit 115 to compute a product of two binary numbers, one of which has a known value (w).
In one or more embodiments of the present invention, the input source 120 can be a memory, a storage device, from which the input-value is provided to the multiplication circuit 115. Alternatively, or in addition, the input-value can be input to the multiplication circuit 115 directly upon acquisition, for example, the input source 120 is a sensor, such as a camera, an audio input device, or any other type of sensor that captures data in a form that can be input to the multiplication circuit 115.
The system 200 and/or the components of the system 200 can be employed to use hardware and/or software to solve problems that are highly technical in nature, that are not abstract and that cannot be performed as a set of mental acts by a human. For example, system 200 and/or the components of the system 200 can be employed to use hardware and/or software to perform operations, including facilitating an efficiency within a neural network. Furthermore, some of the processes performed can be performed by specialized computers for carrying out defined tasks related to facilitating efficiency within a neural network. System 200 and/or components of the system 200 can be employed to solve new problems that arise through advancements in technology, computer networks, the Internet, and the like. System 200 can further provide technical improvements to live and Internet-based learning systems by improving processing efficiency among processing components associated with facilitating efficiency within a neural network.
System 200, as depicted in
The neural network of system 200 presents a simplified example so that certain features can be emphasized for clarity. It can be appreciated that the present techniques can be applied to other neural networks, including ones that are significantly more complex than the neural network of system 200.
In the context of artificial neural networks, each of the neurons performs a computation, for example, during various phases, such as forward propagation, backward propagation, and weight update. Such computations can include multiplication. In one or more embodiments of the present invention, the computations include multiplication of a weight-value assigned to the neuron (which is a known and fixed value), and an input value (which can be variable). Here, the “weight-value” represents the weight that is assigned to a neuron in the neural network 200, and the “input-value” is a value received by that neuron to calculate the output. The calculation can be performed during the training of the neural network or during inference using the neural network. In one or more embodiments of the present invention, the calculation can be performed during any phase of the training, forward propagation, backward propagation, weight update, or any other phase. The performance of the neural network 200 can be improved if the efficiency of the multiplication operation can be improved. One or more embodiments of the present invention facilitate a faster way of calculating a multiplication of an input-value (d) with the weight-value (w). Further, one or more embodiments of the present invention facilitate hardware components to support such calculation using LUTs.
It is noted that although
The technical solutions provided by one or more embodiments of the present invention are now described using an exemplary case where w is an 8-bit value, and d is a 12-bit value. It is understood that in other embodiments of the present invention, the values can have a different number of bits. However, for explaining the operation of the technical solutions of the present invention, the above example scenario is chosen. Accordingly, the computational problem is defined by a fixed nonzero weight-value, which is a vector of bit-values w=(w0, . . . , w7)∈B8, so w=Σi=07wi2i. The input-value is a vector of bit-values d=(d0, . . . , d11)∈B12, so d=Σi=011di2i.
The input-value d can also be represented as d=g 26+h, where g and h are integers such that 0≤g, h<26. Accordingly,
and h=d−g·26.
In this case, p is a vector of bit-values such that p=(p0, . . . , p19)∈B20, so p=Σi=019pi2i, and hence, 0≤p≤220−1. In this document, a function is denoted Pi(d)=Pi(d;w), where the function returns p i=0, . . . , 19.
Now, if n is any natural number, and if g is an n-to-1 Boolean function, and if ƒ1, . . . , ƒn are n 6-to-1 Boolean functions, the composition h(x)=g(ƒ1, (x), . . . , ƒn(x)) is also a 6-to-1 Boolean function. Further, let x={0, . . . , 26−1} and let y=x+1, where x and y are Boolean vectors. For i=0, . . . , 5:
Under these conditions, for i=0, . . . , 5, y, =x, if and only if Ji(x)=0, and y6=1 if and only if J5(x)=1.
It can be proven that for every natural n E N, there exists a 2-level circuit of 6-to-1 Boolean functions, where the circuit decides for every d∈B12 whether or not d<n. If n≥212, then because d<212, the problem is trivial. Hence, consider that n<212. In this case, the number n can be uniquely represented as n=a·26+b, where a and b are integers such that 0≤a, b<26. Here, d<n if and only if either (i) g<a, or (ii) g=a and h<b. Defining the following functions:
Each of these functions is a 6-to-1 Boolean function. It should be noted that although d∈B1-2 each of these functions operates on only six bits of d. Let J: B3→B be the following:
It follows that
d<n⇔J[G(d),E(d),H(d)]=1.
Accordingly, to improve the operation of the computing system 110, the multiplication circuit 115 has to be a circuit of FPGAs of minimum possible depth (i.e., number of layers) so that given a fixed weight-value w, the LUTs can be programmed for calculating the product w×d for any input d∈B1-2:
Here, each of the di, wi, and pi is a bit-value.
The method 400 includes configuring the several LUTs 350 in the multiplier circuit 115 based on the weight-value iv, at block 402. Each LUT 350 provides an output bit based on a set of input bits. The input bits to the LUT 350 can include one or more bit-values from d. In addition, the input bits to the LUT 350 can include one or more auxiliary bit-values output by another LUT 350. In one or more cases, the auxiliary bit-values from the first circuit 310 are used as input bits to one or more LUTs 350 in the other circuits (320 and 330). Similarly, auxiliary bits from the second circuit 320 can be used as input bits to LUTs 350 of the third circuit 330.
The method 400 includes determining an output of the first circuit 310, at block 410. Part of the output from the first circuit 310 includes a predetermined number of LSBs 510 of p, at block 412. As shown in
Further, the first circuit 310 determines a second set of auxiliary bit-values (q0 to q13) 530 that is used as input bit-values to other LUTs 350 from the multiplication circuit 115, at block 416. The second set of auxiliary bit-values 520 is determined based on the six most significant bits (MSBs) of d (d6 to d11). The computation of the second set of auxiliary bit-values 530 can be expressed as:
In addition, the first circuit 310 determines a first ancillary bit-value (q9.8) 540, at block 418. The first ancillary bit-value is communicated to the second circuit 320 and represents:
q
9.8
=q
9
∧q
8.
the output bit-values of the second circuit 320, at block 420. Part of the output of the second circuit 320 includes a predetermined number of bits 610 of p, at block 422. As shown in
Further, the second circuit 320 determines a third set of auxiliary bit-values (y9 to y12) 630, at block 426. The third set of auxiliary bit-values is determined using subsets of the first set of auxiliary bit-values 520 and the second set of auxiliary bit-values 530 (r11, r10, r9, and q5, q4, q3). The computation can be expressed as:
The second circuit 320 further determines a third ancillary bit-value (y10.8) 640 and a fourth ancillary value (y11.9) 650, at block 427. The ancillary bit-values are used by the third circuit 330. In one or more embodiments of the present invention, the third ancillary bit-value (y10.8) 640 and the fourth ancillary value (y11.9) 650 are part of the third set of auxiliary values 630. The ancillary bit-values represent a combination of one or more bit-values from the third set of auxiliary bit-values, and the computation can be expressed as:
y
10.9
=y
10
∧y
9
y
11.9
=y
11
∧y
10
∧y
9 (6)
Further, the second circuit 320, using (r13, r12, and q10, q9, q8, q7, q6), determines a fourth set of auxiliary bit-values (z16 to z12) 660, at block 428. The computation can be expressed as:
Further yet, the second circuit 320 further determines a fifth ancillary bit-value (z14.13) 670 and a sixth ancillary value (z15.13) 680, at block 429. The computation of the ancillary bit-values can be expressed as:
z
14.13
=z
14
∧z
13 (8)
z
15.13
=z
15
∧z
14
∧z
13. (9)
Additionally, the bit-value z16 662 represents the combination:
z
16
=q
10⊗((q9∧q8)∧((r13∧q7)∨(r13∧r12∧q5)∨(q7∧r12∧q6)), (16)
Here, a⊗b=(a∧¬b)∨(¬a∧b). This implies that q9 can be replaced by q9.8=q9 ∧q8 for the computation of z16. Accordingly, z16 662 can be obtained as a function of six input bit-values (r13, r12, and q10, q9.8, q7, q6) to a LUT 350.
Referring back to
The bit p9 is determined using bits (y9, x9) based on:
p
9
≡p
9(y9,x9)=(y9+x9)(mod 2)
The corresponding LUT 350 is depicted in view 910 of
The bit p10 is determined using bits (y10, y9, x9) based on:
The corresponding LUT 350 is depicted in view 920 of
The bit p11 is determined using bits (y11, y10, y9, x9) based on:
The dependence of p11 on y10 and y9 is only to check whether or not (y10, y9)=(1, 1). Hence, y10 and y9 can be replaced by the third ancillary bit 640 (y10.9). The corresponding LUT 350 is depicted in view 930 of
Calculating p12 requires computing the addition:
where c13 is not used, and p9, p10, and p11, are determined as described earlier. Here, the dependence of p12 on y11, y10, and y9 is only to check whether or not (y11, y10, y9)=(1, 1, 1). Hence, y11, y10, and y9 can be replaced by the fourth ancillary bit 650 (y11.9). The corresponding LUT 350 is depicted in view 940 of
Calculating p13 requires computing the addition:
where c14 is not used, and p9, p10, p11, and p12, are determined as described earlier. Here, the dependence of p13 on y11, y10, and y9 is only to check whether or not (y11, y10, y9)=(1, 1, 1). Hence, y11, y10, and y9 can be replaced by a single bit—the fourth ancillary bit 650 (y11.9). Accordingly, p13 can be determined as a function of five variables (z13, z12, y12, y11.9, x9). The corresponding LUT 350 is depicted in view 950 of
Calculating p14 requires computing the addition:
where c15 is not used, and p9, p10, p11, p12, and p13 are determined as described earlier.
Here, the addition is a function of eight variables. As noted earlier, the technical solutions herein overcome the technical challenge of handling such cases with more than six input bit-values. In this particular case, y11, y10, and y9 can be replaced by a single bit—the fourth ancillary bit 650 (y11.9) so that p14 can be determined using the LUT 350 shown in view 1010 of
Calculating p15 requires computing the addition:
where c16 is not used, and p9, p10, p11, p12, p13, and p14 are determined as described earlier.
Here, the addition is a function of nine variables. Again, y11, y10, and y9 can be replaced by a single bit—the fourth ancillary bit 650 (y11.9). Furthermore, the dependence of p15 on z14 and z13 is only to check whether or not (z14, z13) by the fifth ancillary bit-value 670 (z14.13) can be determined using the LUT 350 shown in view 1110 of
Calculating p16 requires computing the addition:
where (q13, . . . , q6) and (r13, r12) are computed in first circuit 310, and (y12, y11, y10, y9) and x9 are computed in the second circuit 320.
As mentioned earlier, the third circuit 330 uses several ancillary bit-values and auxiliary bit-values that are determined by the first circuit 310 and the second circuit 320. For example, the ancillary bit-value (q10.8) 550 is computed at the first circuit 310 to represent:
q
10.8
=q
10
∧q
9
∧q
8
=q
10
·q
9
·q
8.
Further, in the second circuit 320, the result of the following addition can be determined using the LUTs 350:
The result of the addition can be computed as a Boolean function of at most six inputs that are computed in the first circuit 310 as follows:
z
12
=z
12(r12,q6)
z
13
=z
13(r13,r12,q7,q6)
z
14
=z
14(r13,r12,q8,q7,q6)
z
15
=z
15(r13,r12,q9,q8,q7,q6)
z
16
=z
16(r13,r12,q10,q9.8,q7,q6),
and the bit
z
15.13
=z
15.13(r13,r12,q9,q8,q7,q6)=z15·z14·z13
The z15.13 bit-value is the sixth ancillary bit-value 680. The bit p16 can be determined in the third circuit 330 as a Boolean function of six inputs that are determined by the LUTs 350 in the first circuit 310 and/or the second circuit 320:
p
16
=p
16(z16,z15.13,z12,y12,y11.9,x9)
The view 1210 in
Determining the final three MSBs using LUTs 350 is based on the following description of correctness. If N, d, and w, are integers, then:
The part (i) above holds true because if
In the case (ii) above, if
Now, a description is provided for determining the MSB p19 using two LUTs 350. Consider t=┌219/w┐. P19(d)=1↔d≥t. This holds true because P19(d)=1 if and only if w×d≥219. It should be noted that here, 19 is used because the result of a 12-bit d and an 8-bit w cannot exceed 219. However, in the cases where w or d have a different number of bits, the exponent in the above condition is different. Based on the description herein, a person skilled in the art can determine that for every dεB12, the function P19(d) can be evaluated using two layers of LUTs 350.
Accordingly, referring back to the flowchart in
Using only 6-to-1 Boolean functions, the first LUT 1310 determines if u<g, the second LUT 1320 determines if g=u and the third LUT 1330 determines if v≤h. The output of the LUTs is a 1, if the respective conditions hold true, and 0 otherwise. Further, a fourth LUT 1340 receives the outputs from the first LUT 1310, the second LUT 1320, and the third LUT 1330. Depending on the received bit-values, the fourth LUT 1340 determines the value of p19.
Now, a description is provided for determining the MSB p18 using two LUTs 350. It should be noted that p18=1, if and only if one of the following conditions holds:
218≤w×d<219 (i)
219+218≤w×d (ii)
The following three thresholds can be precomputed based on the known w:
Accordingly, the method 400 includes setting p18 to 1, at block 452, and else to 0, at block 454, based on the third circuit 330 determining, at block 450, whether the following condition holds:
t
01
≤d≤t
10 (i)
t
11
≤d. (ii)
The above is further equivalent to setting p18 to 1 if and only if one of the following conditions holds:
(t01<d<t10) or (t11<d) (i)
d∈{t
01
,t
10
,t
11}. (ii)
Again, consider d represented as d=g·26+h. Let us denote B2={01, 10, 11}, and then, for every β∈B2, tβ=uβ·26+vβ, where 0≤vβ≤26. Accordingly, determining the value for p18 can be stated as p18=1 if and only if one of the following conditions holds:
1. g>u01 or (g=u01 and h≥v01)(i.e. d≥t01)
and
2. g<u10 or (g=u10 and h≤v10)(i.e., d≤t10) (i)
g>u
11 or (g=u11 and h≥v11)(i.e.,d≥t11) (ii)
This can be simplified as p18=1 if and only if one of the following eight conditions holds:
00:(u01<g<u10) or (u11<g)
01:g=u01 and h≥v01
10:g=u10 and h≤v10
11: g=u11 and h≥v11
Consider the notation, that with any inequality of variables x<y, z≤w, etc., a truth value ∅(x<y)∈{0,1}, where 1 is “true,” and 0 is “false,” and logical connectives can be applied in the form, for example, ∅(x=y)=∅(x≤y)∧∅(x≥y).
Accordingly, the above eight conditions can be succinctly stated as:
P
18(d)=[ϕ(t01≤d)∧ϕ(d≤t10)]∨ϕ(t11≤d).
The values for t01, t10, and t11 can be represented as:
t
01
=u
01·26+v01 where 0≤v01<26,
t
10
=u
10·26+v10 where 0≤v10<26,
t
11
=u
11·26+v11 where 0≤v11<26.
Now, for every β∈{01, 10, 11}, uβ=tβ/26; hence, u01≤u10≤u11. Each possible number g is related to u01, u10, and u11 in one of seven possible ways, which can be labeled with three bits as follows:
(001):g<u01
(010):g=u01
(011):u01<g<u10
(100):g=u10
(101):u10<g<u11
(110):g=u11
(111):u11<g
It follows that given g, the particular case label (x1, x2, x3) can be returned by three 6-to-1 Boolean functions, x1(g), x2(g), x3(g). Given (x1, x2, x3), the information required for the evaluation of P18(d) can be expressed as:
(u01<g)≡x1∨(¬x1∧x2∧x3)
(u01=g)≡x1∧¬x2∧x3
(g<u10)≡¬x1
(g=u10)≡x1∧¬x2∧¬x3
(u11<g)≡x1∧x2∧x3
(u11=g)≡x1∧x2∧¬x3
The situation with respect to the relation of h to the vβ is simpler. The only information required for the evaluation of P18(d) is captured by the following 6-to-1 Boolean functions:
At this stage, it can be shown that for every d∈B12 the function P18(d) can be evaluated in two layers. The evaluation of P18(d) relies on the inequality relations of d to the tβs. Accordingly:
d≥t
01↔(g>u01)∨[(g=u01)∧(h≥v01)]
d≤t
10↔(g<u10)∨[(g=u10)∧(h≤v10)]
d≥t
11↔(g>u11)∨[(g=u11)∧(h≥v11)]
Further:
ϕ(d≥t01)=ϕ(g>u01)∨(ϕ(g=u01)∧y1(h))
ϕ(d≤t10)=ϕ(g<u10)∨(ϕ(g=u10)∧y2(h))
ϕ(d≥t11)=ϕ(g>u11)∨(ϕg=u11)∨h3(y))
Thus, it follows that the relations of d to the tβs can be evaluated by the following 6-to-1 Boolean function, applied to the 6-tuple (x1(g), x2(g), x3(g), y1(h), y1(h), y3(h)).
It can be shown that if 1≤w≤256, then u01<u10. That is because even if w=256,
Now, a description is provided for determining the third MSB p17 using three layers of LUTs. p17=1 if and only if one of the following conditions holds:
217≤w×d<218 (i)
218+217≤w×d<219 (ii)
219+217≤w×d<219+217 (iii)
219+218+217≤w×d (iv)
The conditions can be restated using seven thresholds:
The conditions can be restated using these seven thresholds as:
t
001
≤d≤t
010 (i)
t
011
≤d≤t
100 (ii)
t
101
≤d≤t
110 (iii)
t
111
≤d. (iv)
Referring to the flowchart in
(t001<d<t010) or (t011<d<t100) or (t101<d<t110) or (t111<d) (i)
d∈{t
001
,t
010
,t
011
,t
100
,t
110
,t
111}. (ii)
Again, consider d represented as d=g·26+h. Let us denote B3={001, 010, 011, 100, 101, 110, 111}, and then, for every β∈B3, tβ=uβ·26+vβ, where 0≤vβ≤26. Accordingly, determining the value for p18 can be stated as p17=1 if and only if one of the following conditions holds:
1. g>u001 or (g=u001 and h≥v001)(i.e., d≥t001)
and
2. g<u010 or (g=u010 and h≤v010)(i.e., d≤t010) (i)
1. g>u011 or (g=u011 and h≥v011)(i.e., d≥t011)
and
2. g<u100 or (g=u100 and h≤v100)(i.e., d≤t100) (ii)
1. g>u101 or (g=u101 and h≥v101)(i.e., d≥t101)
and
2. g<u110 or (g=u110 and h≤v110)(i.e., d≤t110) (iii)
g<u
111 or (g=u111 and h≥v111)(i.e., d≥t111).
This can be simplified as p17=1 if and only if one of the following eight conditions holds:
000:(u001<g<u010) or (u011<g<u100) or (u101<g<u110) or (u111<g)
001:g=u001 and h≥v001
010:g=u010 and h≤v010
011:g=u011 and h≥v011
100:g=u100 and h≤v100
101:g=u101 and h≥v101
110:g=u110 and h≤v110
111:g=u111 and h≥v111
Any Boolean function of (x1, . . . , x4) is also a Boolean function of g, so it can be evaluated in the first circuit 310. Accordingly, to compress the representation to three bits based on the following:
If g satisfies one of the following, then P17(d)=1:
u
001
<g<u
010 (i)
u
011
<g<u
100 (ii)
u
101
<g<u
110 (iii)
u
111
<g (iv)
If g satisfies one of the following, then P17(d)=0:
g<u
001 (i)
u
010
<g<u
011 (ii)
u
100
<g<u
101 (iii)
u
110
<g<u
113 (iv)
However, this still needs encoding of seven possible equalities, namely g=uβ, β∈B3, and so, three bits cannot be used to encode nice cases. However, for every E B3, the case g=up can be encoded by 0β; for example, g=u011 is encoded by 011. Accordingly, the case for P17(d)=1 can be encoded by:
(u001<g<u010) or (u011<g<u100) or (u101<g<u110) or (u111<g)
Additionally, the case for P17(d)=0 can be encoded by:
(u010<g<u011) or(u100<g<u101) or(u110<g<u111)
This four-bit encoding is denoted as (z1, z2, z3, z4). The encoding is depicted by view 1510 in
z
1=ϕ(u001<g<u010)∨ϕ(u011<g<u100)∨ϕ(u101<g<u110)∨ϕ(u111<g)
z
2=ϕ(g=u100)∨ϕ(g=u101)∨ϕ(g=u110)∨ϕ(g=u111)
z
3=ϕ(g=u010)∨ϕ(g=u011)∨ϕ(g=u110)∨ϕ(g=u111)
z
4=ϕ(g=u001)∨ϕ(g=u011)∨ϕ(g=u101)∨ϕ(g=u111)
The situation with respect to the relation of h to the vβs is simpler but more complicated than the case of p18. Here, the following truth-values are required:
ϕ(h≥v001),ϕ(h≤v010),
ϕ(h≥v011),ϕ(h≤v100),
ϕ(h≥v101),ϕ(h≤v110),
ϕ(h≥v111).
Let v1<v2< . . . <vl (l≤7) be the distinct elements of {vβ:β∈B3} and let ψ(β)∈{1, . . . , l} be the index such that vψ(β)=vβ. Thus, the above seven values can be expressed as:
ϕ(h≥vψ(001)),ϕ(h≤vψ(010)),
ϕ(h≥vψ(011)),ϕ(h≤vψ(100)),
ϕ(h≥vψ(101)),ϕ(h≤vψ(110)),
ϕ(h≥vψ(111)),
The relations of h to all of these seven values can be captured by three bits (y1, y2, y3) as follows. First note, that each occurrence of vβ is involved in precisely one inequality, namely, depending on β either ∅(h≥vp) has to be known or ∅(h≤vp) has to be known. The former occurs when β∈{001, 011, 101, 111}, and the latter when β∈{010, 100, 110}. Therefore, there are precisely seven cases that are needed to characterize the location of h with respect to v1<v2< . . . <vl so that the information required is retrieved. The cases can be viewed as a partition of the set {0, 1, . . . , 63} into at most eight intervals, some of which may consist of a single point. These seven cases are defined by inserting the seven inequality signs that occur in the above seven conditions in the appropriate places. The inequality signs also take into account v1, . . . , vl as follows. Let 1≤i≤1, and consider a certain value vi. If i=ψ(β), then (i) if β∈{001, 011, 101, 111}, then include an inequality≤vi, and (ii) if β∈{010, 100, 110}, then include an inequality vi≤. If there exists β1∈{001, 011, 101, 111}, and β2 ∈{010,100,110}, such that i=ψ(β1)=ψ(β2), then two inequalities are included: ≤vi, and vi≤. If it has to be known whether or not h≤vβ, then in the partition vβ must be the right endpoint of one of the intervals, and if it has to be known whether or not h≥vβ, then in the partition vβ must be the left endpoint of one of the intervals. This way, if it is known which of the intervals contains h, then the information that is required about its relation to any vβ is known. For example, suppose:
v001=3 v010=4
v011=7 v100=8
v101=11 v110=12
v111=15
Here, it is to be determined whether or not h≥3, whether or not h<4, whether or not h≥7, whether or not h≤8, whether or not h≥11, whether or not h≤12, and whether or not h≥15. Therefore, the partition into eight intervals is the following:
000:0≤h≤2
001:3≤h≤4
010:5≤h≤6
011:7≤h≤8
100:9≤h≤10
101:11≤h≤12
110:13≤h≤14
111:15≤h≤63
The labels on the left provide the encoding with three bits, so the corresponding three Boolean functions are represented in view 1520 in
b
1(h)=ϕ(9≤h≤63)
b
2(h)=ϕ(5≤h≤8)∨ϕ(13≤h≤63)
b
3(h)=ϕ(3≤h≤4)∨ϕ(7≤h≤8)∨ϕ(11≤h≤12)∨ϕ(15≤h≤63).
Thus, it follows that the relations of d to the tβs can be evaluated by a single 7-to-1 Boolean function ƒ(z1, . . . , z4, b1, b2, b3). This evaluation can be carried out as follows. First, a Boolean function ƒ(z2, z3, z4, b1, b2, b3) is defined and implemented in the second circuit 320 by setting its value to 1 when (z2, z3, z4)≠(0, 0, 0), and every x in the interval indicated by (b1, b2, b3) satisfies the bound on h that is required given that g has the value indicated by (z2, z3, z4); otherwise, ƒ(z2, z3, z4, b1, b2, b3)=0. In the third circuit 330, the value of P17 is set to 1 if z1=1 or ƒ(z2, z3, z4, b1, b2, b3)=1; otherwise, P17 is set to 0.
Accordingly, embodiments of the present invention provide a circuit of 6-input Boolean gates for multiplying a given (known) Boolean vector w=(w7, w6, w5, w4, w3, w2, w1, w0) by any Boolean input vector d=(d11, d10, d9, d8, d7, d6, d5, d4, d3, d2, d1, d0). w has a predetermined length, for example, eight bits. d has a predetermined length, for example, twelve bits. By implementing the circuit using LUTs, for example, in an FPGA, an ASIC, or any other electronic circuit or device, embodiments of the present invention improve the efficiency of determining the product result in lesser time than performing the computation. The circuit can be used in a variety of computing environments, such as a neural network system, a computing device, a quantum computer, a mainframe computer, a memory controller, or any other type of apparatus that requires computing multiplications, and particularly where one of the numbers in the multiplication is a known value.
Further, embodiments of the present invention use FPGAs that limit each of the LUTs to use at most six inputs. Accordingly, embodiments of the present invention facilitate a practical application of determining a multiplication result and improving the efficiency of such computations performed by present solutions. Embodiments of the present invention, accordingly, provide an improvement to a particular technology, in this case, computing technology. Further yet, embodiments of the present invention facilitate improvements to present solutions such as neural networks and other types of computing systems by improving their efficiency at computing such multiplications, the results of which are used in various applications.
The neural network system 200 can be implemented using a computer system or any other apparatus. Turning now to
As shown in
The computer system 1700 comprises an input/output (I/O) adapter 1706 and a communications adapter 1707 coupled to the system bus 1702. The I/O adapter 1706 may be a small computer system interface (SCSI) adapter that communicates with a hard disk 1708 and/or any other similar component. The I/O adapter 1706 and the hard disk 1708 are collectively referred to herein as a mass storage 1710.
Software 1711 for execution on the computer system 1700 may be stored in the mass storage 1710. The mass storage 1710 is an example of a tangible storage medium readable by the processors 1701, where the software 1711 is stored as instructions for execution by the processors 1701 to cause the computer system 1700 to operate, such as is described hereinbelow with respect to the various Figures. Examples of computer program product and the execution of such instruction is discussed herein in more detail. The communications adapter 1707 interconnects the system bus 1702 with a network 1712, which may be an outside network, enabling the computer system 1700 to communicate with other such systems. In one embodiment, a portion of the system memory 1703 and the mass storage 1710 collectively store an operating system, which may be any appropriate operating system, such as the z/OS or AIX operating system from IBM Corporation, to coordinate the functions of the various components shown in
Additional input/output devices are shown as connected to the system bus 1702 via a display adapter 1715 and an interface adapter 1716 and. In one embodiment, the adapters 1706, 1707, 1715, and 1716 may be connected to one or more I/O buses that are connected to the system bus 1702 via an intermediate bus bridge (not shown). A display 1719 (e.g., a screen or a display monitor) is connected to the system bus 1702 by a display adapter 1715, which may include a graphics controller to improve the performance of graphics intensive applications and a video controller. A keyboard 1721, a mouse 1722, a speaker 1723, etc. can be interconnected to the system bus 1702 via the interface adapter 1716, which may include, for example, a Super I/O chip integrating multiple device adapters into a single integrated circuit. Suitable I/O buses for connecting peripheral devices such as hard disk controllers, network adapters, and graphics adapters typically include common protocols, such as the Peripheral Component Interconnect (PCI). Thus, as configured in
In some embodiments, the communications adapter 1707 can transmit data using any suitable interface or protocol, such as the internet small computer system interface, among others. The network 1712 may be a cellular network, a radio network, a wide area network (WAN), a local area network (LAN), or the Internet, among others. An external computing device may connect to the computer system 1700 through the network 1712. In some examples, an external computing device may be an external web server or a cloud computing node.
It is to be understood that the block diagram of
The present invention may be a system, a method, and/or a computer program product at any possible technical detail level of integration. The computer program product may include a computer-readable storage medium (or media) having computer-readable program instructions thereon for causing a processor to carry out aspects of the present invention.
The computer-readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer-readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer-readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer-readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
Computer-readable program instructions described herein can be downloaded to respective computing/processing devices from a computer-readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium within the respective computing/processing device.
Computer-readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state-setting data, configuration data for integrated circuitry, or either source-code or object code written in any combination of one or more programming languages, including an object-oriented programming language such as Smalltalk, C++, or the like, and procedural programming languages, such as the “C” programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer-readable program instruction by utilizing state information of the computer-readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer-readable program instructions may be provided to a processor of a general-purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
The computer-readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer-implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the blocks may occur out of the order noted in the Figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
The descriptions of the various embodiments of the present invention have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments described herein.
Various embodiments of the invention are described herein with reference to the related drawings. Alternative embodiments of the invention can be devised without departing from the scope of this invention. Various connections and positional relationships (e.g., over, below, adjacent, etc.) are set forth between elements in the following description and in the drawings. These connections and/or positional relationships, unless specified otherwise, can be direct or indirect, and the present invention is not intended to be limiting in this respect. Accordingly, a coupling of entities can refer to either a direct or an indirect coupling, and a positional relationship between entities can be a direct or indirect positional relationship. Moreover, the various tasks and process steps described herein can be incorporated into a more comprehensive procedure or process having additional steps or functionality not described in detail herein.
The following definitions and abbreviations are to be used for the interpretation of the claims and the specification. As used herein, the terms “comprises,” “comprising,” “includes,” “including,” “has,” “having,” “contains” or “containing,” or any other variation thereof, are intended to cover a non-exclusive inclusion. For example, a composition, a mixture, process, method, article, or apparatus that comprises a list of elements is not necessarily limited to only those elements but can include other elements not expressly listed or inherent to such composition, mixture, process, method, article, or apparatus.
Additionally, the term “exemplary” is used herein to mean “serving as an example, instance or illustration.” Any embodiment or design described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments or designs. The terms “at least one” and “one or more” may be understood to include any integer number greater than or equal to one, i.e., one, two, three, four, etc. The terms “a plurality” may be understood to include any integer number greater than or equal to two, i.e., two, three, four, five, etc. The term “connection” may include both an indirect “connection” and a direct “connection.”
The terms “about,” “substantially,” “approximately,” and variations thereof, are intended to include the degree of error associated with measurement of the particular quantity based upon the equipment available at the time of filing the application. For example, “about” can include a range of ±8% or 5% or 2% of a given value.
For the sake of brevity, conventional techniques related to making and using aspects of the invention may or may not be described in detail herein. In particular, various aspects of computing systems and specific computer programs to implement the various technical features described herein are well known. Accordingly, in the interest of brevity, many conventional implementation details are only mentioned briefly herein or are omitted entirely without providing the well-known system and/or process details.