The present invention is related to a k-cluster residue number system, and more particularly, to a memory-based k-cluster residue number system capable of performing multiplicative scaling, overflow detection, and mixed sign iterative division.
Edge artificial intelligence (AI) computing is an area of rapid growth, which integrates neural networks with the Internet of Things (IoT) together for computer vision, natural language processing, and self-driving car applications, it quantizes the floating-point number to fixed-point integer for inference operations. In-memory architecture is one of the important Edge AI computing platforms, which stacks the memory over the top of the logic circuits for Memory Centric Neural Computing (MCNC). The data is directly loaded from stacked memory to Processing Elements (PEs) for computation, it avoids loading the data from the external memory and minimizes data transfer. It significantly reduces the latency and speeds up the operations. The performance is further enhanced using Residue Number System (RNS), which fully utilizes the internal memory to store the data for integer operations.
Residue Number System (RNS) is a number system, which first defines the moduli set and transforms the numbers to their integer remainders (also called residue) through modulo division, then performs the arithmetic operations (addition and multiplication) on the remainders only. For example, the moduli set is defined as (7, 8, 9) with the numbers 13 and 17. The dynamic range is defined by the product of the moduli set with the range 504. It first transforms the numbers to their residue through the modulo operations 13→(6, 5, 4) and 17→(3, 1, 8), then performs addition and multiplication on residues only, (6, 5, 4)+(3, 1, 8)=(9, 6, 12)→(2, 6, 3), which is equal to 30. (6, 5, 4)*(3, 1, 8)=(18, 5, 32)→(4, 5, 5), which is equal to 221. Since the remainder magnitude is much smaller, it only requires simple logic for parallel computations. The drawback of RNS is sign detection, magnitude comparison, and division support. The residues are required to convert back to the binary number domain for those operations.
To improve the Edge AI computing performance, it first performs the floating-point to integer quantization, which converts the trained neural network model to the integer one. It simplifies the design and operations and provides an energy-efficient solution. The k-Cluster Residue Number System (k-RNS) is proposed to enhance neural network inference through parallel distributive computation. It breaks down the integers to their remainders (residues) with different moduli sets, then performs the addition, subtraction, and multiplication on the remainders only. The k-RNS resolves the conventional RNS issues, sign detection, magnitude comparison, and division. It also scales the convolution product, then, no additional moduli sets are required to increase the dynamic range. It can also detect the integer overflow and adjust the summation of convolution products. Finally, the optimal division is proposed to further enhance the k-RNS operations. Therefore, K-Cluster Residue Number System (k-RNS) becomes useful for Edge AI computing.
In an embodiment, a k-cluster residue number system comprises a processor and memory coupled to the processor. The processor is configured to generate a modular set composed of P coprime integers, generate a dynamic range by taking a product of the P coprime integers, generate quotient indices for all integers in the dynamic range, generate row indices for all integers in the dynamic range, generate column indices for all integers in the dynamic range, and generate a look-up table according to the quotient indices, row indices, the column indices, and all integers in the dynamic range. P is an integer greater than 2, and the P coprime integers include 2. The memory is configured to store the look-up table.
In another embodiment, a method for generating a k-cluster residue number system comprises generating a modular set composed of P coprime integers, generating a dynamic range by taking a product of the P coprime integers, generating quotient indices for all integers in the dynamic range, generating row indices for all integers in the dynamic range, generating column indices for all integers in the dynamic range, generating a look-up table according to the quotient indices, row indices, the column indices, and all integers in the dynamic range, and storing the look-up table in a memory of the k-cluster residue number system. P is an integer greater than 2, and the P coprime integers include 2. The memory is configured to store the look-up table.
These and other objectives of the present invention will no doubt become obvious to those of ordinary skill in the art after reading the following detailed description of the preferred embodiment that is illustrated in the various figures and drawings.
To represent an n-bit integer and it's negative using a k-cluster residue number system (k-RNS), it first defines a modular set of p coprime integers as m1, . . . , 2, . . . , mp) where a dynamic range is generated according to the product of the modular set (m1, . . . , 2, . . . , mp). When a modular set of 3 coprime integers is chosen to be (2n/2−1, 2, 2n/2+1), the dynamic range is set to [−(2n−1), (2n−2)]. The modular set is not limited to 3 coprime integers, the number of coprime integers in the modular set can be increased to increase the dynamic range and keep the moduli small. In this case, the k-RNS converts each integer in the dynamic range to its row indices and column index formed by remainders through modulo division such as Equations (1) and (2).
ri=I mod mi, when I is a positive integer (1)
ri=(M−I) mod mi, when I is a negative integer (2)
I is an integer in the dynamic range;
M is the number of integers in the dynamic range; and mi is a coprime integer of the modular set.
The look-up table 8 may include 9 columns: cluster index, quotient index qi−1 of modulus mi−1 (i.e., a quotient index of modulus 3), index ri−1 of the modulus mi−1, quotient index qi+1 of modulus mi+1 (i.e., a quotient index of modulus 5), positive integer column, column index ri of the positive integer, negative integer column, and column index ri of the negative integer. In this example, since the modular set has 3 coprime integers, each integer has 2 quotient indices, 2 row indices, and a column index. The positive integer column may list positive integers from 0 to 14 in ascending order. The negative integer column may list negative integers from −15 to −1 in ascending order. The integers are grouped according to the first row index modulo behavior. The integers 0 to 2, and −15 to −13 may be grouped to cluster 1. The integers 3 to 5, and −12 to −10 may be grouped to cluster 2. The integers 6 to 8, and −9 to −7 may be grouped to cluster 3. The integers 9 to 11, and −6 to −4 may be grouped to cluster 4. The integers 12 to 14, and −3 to −1 maybe grouped to cluster 5. This grouping approach is only for an illustrative purpose, not for limiting the scope of the embodiment.
The processor 20 converts 0 to (0,0,0) through dividing (3,2,5), the coprime integers of the modular set, since (0,0,0) are remainders of 0 over (3,2,5); and converts −15 to (0,1,0) through dividing (3,2,5) since (0,1,0) are remainders of −15 over (3,2,5). The processor 20 converts 1 to (1,1,1) through dividing (3,2,5) since (1,1,1) are remainders of 1 over (3,2,5) and converts −14 to (1,0,1) through dividing (3,2,5) since (1,0,1) are remainders of −14 over (3,2,5). The processor 20 converts 2 to (2,0,2) through dividing (3,2,5) since (2,0,2) are remainders of 2 over (3,2,5) and converts −13 to (2,1,2) through dividing (3,2,5) since (2,1,2) are remainders of −13 over (3,2,5). The same approach can be applied to other numbers and is thus not elaborated herein.
Because 0 and −15 have the same row numbers (0,0), 0 and −15 are listed in the same row. Their difference is that 0 has a column number of 0, and −15 has a column number of 1. Because 1 and −14 have the same row numbers (1,1), 1 and −14 are listed in the same row. Their difference is that 1 has a column number of 1, and −14 has a column number of 0. Because 2 and −13 have the same row numbers (2,2), 2 and −13 are listed in the same row. Their difference is that 2 has a column number of 0, and −13 has a column number of 1.
The quotient is equal to the quotient index qi−1 when the integer I is divided by the modulus mi−1, and the quotient is equal to the quotient index qi+1 when the integer is divided by the modulus mi+1. In the embodiment, since the modular set (m1, m2, m3) is chosen as (2n/2−1, 2, 2n/2+1)=(24/2−1, 2, 24/2+1)=(3,2,5), the quotient is equal to the quotient index qi−1 when the integer is divided by 3, and the quotient is equal to the quotient index qi+1 when the integer is divided by 5.
For Edge AI computing, the processor 20 converts the floating-point number to a fixed-point integer through quantization. Assume the quantization is symmetrical, the floating-point number is defined between [−α, α] and its fixed-point integer xq is quantized in the range [−αq, αq].
To avoid the integer overflow, multiplicative scaling is used to scale down the convolution product. It first represents two integers w and x in terms of the moduli set shown in
The multiplication scaling circuit 22 of processor 20 is illustrated in
according to the quotient index qi−1 and the row index ri+1. The second calculating unit 108 is configured to output a value of
according to the quotient index qi+1 and the row index ri−1. The multiplier 110 has a first input coupled to an output of the first quotient unit for receiving the quotient index qi+1, a second input coupled to an output of the second quotient unit for receiving the quotient index qi+1, and an output for outputting a product of the quotient index qi−1 and the quotient index qi+1. The rounding unit 111 has a first input coupled to an output of the first calculating unit 106 for receiving the value of
a second input coupled to an output of the second calculating unit 108 for receiving the value of
and an output for outputting the value of
The adder 112 has a first input coupled to the output of the rounding unit 111 for receiving the value of
a second input coupled to an output of the multiplier 110 for receiving the product of the quotient index qi−1 and the quotient index qi+1, and an output for outputting a sum of the value of
and the product of the quotient index qi−1 and the quotient index qi+1. This approach is not only applied for the scaling, the factor
is used to record the multiplication overflow. The multiplication scaling circuit 22 may perform multiplication overflow correction according to the value of the factor
If the factor
is odd, the residue ri should be interchanged 0<->1; otherwise, the residue ri is unchanged if the factor
is even.
To illustrate the multiplication scaling, two integers 13 and 11 are multiplied by each other and divided by the scaling factor 15 to generate a result as
With the multiplication scaling, 13 and 11 are represented as 13=(4×3+1) and 11=(2×5+1), then the processor 20 divides the product with the scaling factor,
The rounding operations can be realized using following k-RNS multiplicative scaling rounding look-up table 1 and table 2. Similarly, the negative multiplication scaling first converts the integer to be positive and performs the multiplication scaling. The result is adjusted through the sign change.
The k-RNS 10 can also detect the integer overflow due to the summation of the convolution products. It fully utilizes the k-RNS periodic behavior to detect the overflow, and the overflow only occurs when both integers have the same sign (either both augend and addend are positive or negative). The integer overflow can be corrected by switching the residue ri from 0 to 1 or from 1 to 0 with the dynamic range [−(2n−1), (2n−2)]. Assume two positive integers 11→(2,1,1) and 14→(2,0,4) are added together, the result becomes (1,1,0)→−5. The sign of the augend/addend and the sign of the sum are different, it shows the integer overflows. The result is corrected as (1,0,0)→10. It is consistent with the calculation 11+14=25=10+15 with a range [0,14]. Similarly, two negative integers −11→(1,1,4) and −14→(1,0,1) will generate a sum (2,1,0)→5 with a positive sign, the sum (2,1,0)→5 is adjusted to be (2,0,0)→−10. It is consistent with the calculation −11−14=−25=−15−10 with a range [−15,−1].
The overflow detection circuit 24 of processor 20 is illustrated in
For the k-RNS division, processor 20 first constructs the following quotient factor lookup table 3, which is defined by the minimum value in the dividend cluster and the maximum value in the divisor cluster.
Assign X0=X and Q0=0, then, the division circuit 26 of the processor 20 performs the iterative subtraction:
Division Q=X/Y (17)
Initialize divided X0=X (18)
Initialize quotient Q0=0 (19)
Iterative subtraction Xi+1=Xi−qiY (20)
where
X is the dividend;
Y is the divisor;
X0 is the initialized divided;
Q0 is the initialized quotient;
Xi is a temporary dividend during the iterative division; and
Xi+1 is an updated dividend.
To support the signed division, it first determines the signs of the dividend X and divisor Y, then converts the mixed sign division into the positive one and performs the iterative division. It finally converts the quotient and its remainder according to the following k-RNS Quotient/Remainder Conversion Table 4 using the signs of the dividend X and divisor Y to simplify the design.
The division circuit 26 of the processor 20 is illustrated in
To illustrate the iterative division using iterative subtraction, assume the dividend X is 14→(2,0,4) and the divisor Y is 2→(2,0,2). X0 is set to (2,0,4) (equation 18) and Q0 is initialized to zero (0,0,0) (equation 6). Based on the dividend cluster index #5 and the divisor cluster index #1, the quotient factor q0 is set to 6→(0,0,1) using Table 3. X′=(2,0,4)−(0,0,1)×(2,0,2)=(2,0,2) (equation 19). Since the result (2,0,2) is positive, it updates both Xi and Q1 where X1=X′=(2,0,2) and Q1=(0,0,0)+(0,0,1)=(0,0,1) (equation 20). It continues the iteration, the cluster index of X1 is updated to #1 and q1 is set to 1→(1,0,1), then X′=(2,0,2)−(1,0,1)×(2,0,2)=(0,0,0). The result is zero and the iteration is terminated. The final quotient is updated, Q2=(0,0,1)+(1,1,1)=(1,1,2)→7 and the remainder is set to zero. X2=X′=(0,0,0)→0. The result is consistent with the calculation 14/2=7 with zero remainder.
For negative division, the dividend X is set to −14→(1,0,1) and the divisor Y is kept at 2→(2,0,2), then the processor 20 converts the dividend X into positive and performs the iterative division with quotient Q=(1,1,2)→7 and the remainder R=(0,0,0)→0. Based on Table 4, the quotient is changed to −7 and the remainder is set to zero, it matches the calculation where −14/2=−7. Compare with the conventional RNS division, the k-RNS division of the present invention offers a better solution, it not only supports the mixed sign integer division with the same logic implementation but also reduces the number of iterations from 7 to 2. It simplifies the overall logic design and significantly speeds up the operations.
The k-RNS 10 of the present invention may perform multiplicative scaling to eliminate additional moduli set for overflow protection and simplify the scaling using the lookup table approach. The k-RNS 10 may also detect integer overflow to correct the results after overflow and record the overflow cycles for computation (i.e., scaling, normalization, etc.). The k-RNS 10 may perform mixed sign iterative division to reuse the positive iterative division to simplify mixed sign division and correct the signs of quotient and remainder after division.
Those skilled in the art will readily observe that numerous modifications and alterations of the device and method may be made while retaining the teachings of the invention. Accordingly, the above disclosure should be construed as limited only by the metes and bounds of the appended claims.