This application claims priority of China application No. 202210608229.8, filed on May 31, 2022, which is incorporated by reference in its entirety.
The present application relates to a calculator and particularly to a calculator capable of accelerating the number-theoretic transformation.
Since artificial intelligence (AI) models, such as the neural network models, can analyze huge amounts of data and extract meaningful information from it, they can be useful for many kinds of industries. However, AI models often require large amounts of expensive computing hardware resources that not every company or research institute can afford; therefore, in order to allow more industries to benefit from the data analysis capabilities of AI, some server providers have started to provide remote computing services. In other words, users can upload the data they want to calculate or analyze to the cloud, and the server providers can provide the service of computing data remotely, and then eventually transmit the calculation results back to the users.
However, the data provided by the user may be confidential and therefore such a service may have security issues. Homomorphic encryption has been introduced to improve the security of data during such services. The homomorphic encryption allows the provider of computing services to perform a specific form of algebraic operation on the encrypted ciphertext, and the encrypted data obtained from the algebraic operation, when decrypted, may be the same as the result of the same algebraic operation on the plaintext data. In other words, the computing service provider can directly use the ciphertext to perform a specific form of computation, such as linear computation, without knowing the contents of the plaintext data, thus improving the security of the service. However, the format of the ciphertext generated by homomorphic encryption is polynomial, so the computation of the ciphertext often involves polynomial multiplication with high complexity, which requires more time or hardware resources for the computing service provider to complete the computation. Therefore, how to improve the computational performance of homomorphic encryption has become an urgent issue in the related field.
One purpose of the present disclosure is to disclose a calculator and an associated calculation method to address the foregoing issues.
One embodiment of the present disclosure discloses a calculator, configured to perform number-theoretic transformation on a 2N-dimensional polynomial, wherein N is an integer greater than 1. The calculator includes a first coefficient memory, a second coefficient memory, 2M processing units and a data flow controller. The first coefficient memory is configured to store 2N coefficients of the 2N-dimensional polynomial, in an initial period. The twiddle factor memory is configured to store (2N−1) twiddle factors. The 2M processing units are configured to perform N coefficient computation operations in parallel, wherein M is an integer greater than 1 and smaller than N. The data flow controller is configured to control the 2M processing units to access the addresses of the first coefficient memory, the second coefficient memory and the twiddle factor memory. In each odd-number round of coefficient computation operation, the 2M processing units perform 2(N−M−1) rounds of first calculation procedures to read 2N first coefficients from the first coefficient memory and read at least one first twiddle factor from the twiddle factor memory and perform modulo calculation, and the 2M processing units perform 2(N−M−1) rounds of first writing procedure to write 2N first output coefficients generated during computation in the second coefficient memory. In each even-number round of coefficient computation operation, the 2M processing units perform 2(N−M−1) rounds of second calculation procedure to read 2N second coefficients from the second coefficient memory and read at least one second twiddle factor from the twiddle factor memory and perform modulo calculation, and the 2M processing units perform 2(N−M−1) rounds of second writing procedure to write 2N second output coefficients generated during computation in the first coefficient memory.
Another embodiment of the present disclosure discloses a calculation method. The method includes, in an initial period, storing 2N coefficients of a 2N-dimensional polynomial to a first coefficient memory, and storing (2N−1) twiddle factors corresponding to the 2N-dimensional polynomial to a twiddle factor memory; in a computation period, using the 2M processing units to perform N coefficient computation operations in parallel, including: in each odd-number round of coefficient computation operation, allowing the 2M processing units to perform 2(N−M−1) rounds of first calculation procedures to read 2N first coefficients from the first coefficient memory and at least one first twiddle factor from the twiddle factor memory read and perform modulo calculation, and allowing the 2M processing units to perform 2(N−M−1) rounds of first writing procedure to write 2N first output coefficients generated during computation in the second coefficient memory, and in each even-number round of coefficient computation operation, allowing the 2M processing units to perform 2(N−M−1) rounds of second calculation procedure to read 2N second coefficients from the second coefficient memory and read at least one second twiddle factor from the twiddle factor memory and perform modulo calculation, and allowing the 2M processing units to perform 2(N−M−1) rounds of second writing procedure to write 2N second output coefficients generated during computation in the first coefficient memory. In such case, N is an integer greater than 1, and M is an integer greater than 1 and smaller than N.
In view of the foregoing, the calculator and calculation method of the present disclosure can perform modulo calculations of number-theoretic transformation using multiple processing units in parallel, and can access the data in two coefficient memories according to a specific order, thereby simplifying the wirings between the processing units and coefficient memories and improve the overall computation performance thereof.
The following disclosure provides various different embodiments or examples for implementing different features of the present disclosure. Specific examples of components and arrangements are described below to simplify the present disclosure. These are, of course, merely examples and are not intended to be limiting. For example, the formation of a first feature over or on a second feature in the description that follows may include embodiments in which the first and second features are formed in direct contact and may also include embodiments in which additional features may be formed between the first and second features, such that the first and second features may not be in direct contact. In addition, the present disclosure may repeat reference numerals and/or letters in the various embodiments. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed.
Notwithstanding that the numerical ranges and parameters setting forth the broad scope of the invention are approximations, the numerical values set forth in the specific examples are reported as precisely as possible. Any numerical value, however, inherently contains certain errors necessarily resulting from the standard deviation found in the respective testing measurements. Also, as used herein, the term “about” generally means within 10%, 5%, 1%, or 0.5% of a given value or range. Alternatively, the term “generally” means within an acceptable standard error of the mean when considered by one of ordinary skill in the art. As could be appreciated, other than in the operating/working examples, or unless otherwise expressly specified, all of the numerical ranges, amounts, values, and percentages (such as those for quantities of materials, duration of times, temperatures, operating conditions, portions of amounts, and the likes) disclosed herein should be understood as modified in all instances by the term “generally.” Accordingly, unless indicated to the contrary, the numerical parameters set forth in the present disclosure and attached claims are approximations that can vary as desired. At the very least, each numerical parameter should at least be construed in light of the number of reported significant digits and by applying ordinary rounding techniques. Here, ranges can be expressed herein as from one endpoint to another endpoint or between two endpoints. All ranges disclosed herein are inclusive of the endpoints, unless specified otherwise.
The first coefficient memory 110 can store 2N coefficients P[0] to P[2N−1] of the 2N-dimensional polynomial P1 in an initial period, and can store the data outputted by the processing unit 140 during computation. The second coefficient memory 120 can store the data outputted by the processing unit 140 during computation, and the twiddle factor memory 130 can store (2N−1) twiddle factors ω[1] to ω[2N−1] required for performing the number-theoretic transformation on the polynomial P1. Generally, the twiddle factors ω[1] to ω[2N−1] can be calculated in advanced according to the algorithm of the number-theoretic transformation.
Further, 2M processing units 140 can perform the modulo calculation required by the number-theoretic transformation according to the coefficients stored in the first coefficient memory 110 or the second coefficient memory 120 and the twiddle factor stored in the twiddle factor memory 130, and the data flow controller 150 can control the access addresses of the 2M processing units 140 for accessing the first coefficient memory 110, the second coefficient memory 120 and the twiddle factor memory 130 so as to ensure that the 2M processing units 140 can obtain the correct coefficients for performing the computation.
In the present embodiment, the calculator 100 can perform the computation of the number-theoretic transformation using an iterative approach, such as using the algorithm proposed by Cooley and Tukey.
For example, in the first coefficient computation operation, the twiddle factor ω[1] may be adopted to perform the 2(N-1) round of modulo calculations, and in the second coefficient computation operation, the twiddle factors ω[2] may be adopted to perform the 2(N-2) round of modulo calculations, and the twiddle factors ω[3] may be adopted to perform the 2(N-2) round of modulo calculations, and so on so forth. In such case, in each round of coefficient computation operation, the calculator 100 may perform modulo calculations on 2N input coefficients according to corresponding twiddle factors and generate 2N output coefficients.
In the present embodiment, the 2M processing units 140 can perform modulo calculations of N round of coefficient computation operations in parallel; such as the content calculated in the third layer of the for-loop of
Since the total number of coefficients read and generated in each coefficient computation operation is fixed (i.e. the total number is 2N), in the present embodiment, the first coefficient memory 110 and the second coefficient memory 120 can respectively have sufficient space for storing 2N coefficients, and the data flow controller 150 can alternately allowing the processing unit 140 to read the coefficient from one of the first coefficient memory 110 and the second coefficient memory 120, and write the calculation result in the other of the first coefficient memory 110 and the second coefficient memory 120. For example, in the first round of coefficient computation operation, the data flow controller 150 can control the processing unit 140 to read coefficients P[0] to P[2N−1] of the 2N-dimensional polynomial P1 from the first coefficient memory 110, and after performing the computation, control the processing unit 140 to store the computation result in the second coefficient memory 120. Next, in the second round of coefficient computation operation, the data flow controller 150 can control the processing unit 140 to read the coefficient obtained by the previous calculation from the second coefficient memory 120, and after performing the computation, control the processing unit 140 to store the computation result in the first coefficient memory 110 for the use in the next round of coefficient computation operation. In other words, in odd-number rounds of coefficient computation operations, the data flow controller 150 can control the processing unit 140 to read the coefficients from the first coefficient memory 110 to perform computation, and write the computation results to the second coefficient memory 120; whereas in even-number rounds of coefficient computation operations, the data flow controller 150 can control the processing unit 140 to read the coefficients from the second coefficient memory 120 to perform computation, and write the computation results to the first coefficient memory 110.
Further, since the algorithm of number-theoretic transformation is fixed, when performing the number-theoretic transformation on different polynomials, the order in which the coefficients may be accessed in each round should also be fixed and known. In such case, by properly arranging the order of reading and writing of coefficients, it is possible to read coefficients from the first coefficient memory 110 and write the output coefficients to the second coefficient memory 120 according to the same addresses for each odd-number round of coefficient computation operation. Similarly, it is possible to read coefficients from the second coefficient memory 120 and write the output coefficients to the first coefficient memory 110 according to the same addresses for each even-number round of coefficient computation operation. In this way, the access operations of 2M processing units on the first coefficient memory 110 and the second coefficient memory 120 can be simplified, thereby simplifying the operation of the calculator 100 and improving the performance of the 2M processing units 140 when performing parallel computations.
In the present embodiment, Step S210 and Step S220 can be performed in an initial period before the computation operation is executed. In Step S210, the 2N coefficients P[0] to P[2N−1] in the 2N-dimensional polynomial P1 can be stored in the first coefficient memory 110, and in Step S220, (2N−1) the twiddle factors ω[1] to ω[2N−1] corresponding to the 2N-dimensional polynomial P1 can be stored in the twiddle factor memory 130.
During the computation, the calculator 100 may use the 2M processing units to perform Steps S240 to S280 to complete the N rounds of coefficient computation operations, and then proceed to Step S290 to complete the computation after said N rounds of coefficient computation operations.
In Step S240, the calculator 100 can first determine whether the coefficient computation operation currently being performed is an odd-number round (such as the first round, the third round or the fifth round) or an even-number round (such as the second round, the fourth round or the sixth round). In Step S240, when the calculator 100 determines that the coefficient computation operation currently being performed is an odd-number round, it can then perform Step S250 and Step S260, whereas when the calculator 100 determines that the coefficient computation operation currently being performed is an even-number round, it can then perform Step S270 and Step S280.
As shown in
According to the algorithm of number-theoretic transformation, the first calculation procedure in the odd-number rounds of coefficient computation operations and the second calculation procedures in the even-number rounds of coefficient computation operations may include substantially the same operations with the difference in the coefficients and twiddle factors that the two read. Further, in each round, when performing the first calculation procedure of the odd-number round of coefficient computation operation or the second calculation procedure of the even-number round of coefficient computation operation, each processing unit 140 performs the modulo calculations according to two coefficients and one twiddle factors to generate two output coefficients. In such case, to allow the 2M processing units to effectively perform computations in parallel, the first coefficient memory 110 can include 2(M+1) first coefficient storage blocks, and each first coefficient storage block can store 2(N−M−1) coefficients. Similarly, the second coefficient memory 120 can also include 2(M+1) second coefficient storage blocks, and each second coefficient storage block can store 2(N−M−1) coefficients. In this way, when performing each round of first calculation procedure or second calculation procedure, each processing unit 140 can read the required coefficients from two corresponding first coefficient storage blocks or two corresponding second coefficient storage blocks.
In such case, in each round of first calculation procedure of Step S250, each of the processing units 1401 to 1404 may respectively read one first coefficient from each of two first coefficient storage blocks of the first coefficient storage blocks 1121 to 1128, and in each round of second calculation procedure of Step S270, each of the processing units 1401 to 1404 may respectively read one first coefficient from each of two second coefficient storage blocks of the second coefficient storage blocks 1221 to 1228.
Further, in each first writing procedure of Step S260, each of the processing units 1401 to 1404 may also write two first output coefficients generated during the computation to two second coefficient storage blocks of second coefficient storage blocks 1221 to 1228, and in each second writing procedure of Step S280, each of the processing units 1401 to 1404 may then write two second output coefficients generated during the computation to two first coefficient storage blocks of the first coefficient storage blocks 1121 to 1128.
In some embodiments, in order to simplify the access operations of the processing units 1401 to 1404 on the first coefficient memory 110 and the second coefficient memory 120, the calculator 100 can store 32 coefficients P[0] to P[31] of the polynomial P1 and output coefficients generated by each round of coefficient computation operation according to a specific order. That is, in each round of first calculation procedure of Step S250, each of the processing units 1401 to 1404 may read two first coefficients from two corresponding first coefficient storage blocks according to the same address, and in each first writing procedure of Step S260, each of the processing units 1401 to 1404 can write two first output coefficients to two second coefficient storage blocks according to the same address. Similarly, in each round of second calculation procedure of Step S270, each of the processing units 1401 to 1404 may read two second coefficients from two corresponding second coefficient storage blocks according to the same address, and in each second writing procedure of Step S280, each of the processing units 1401 to 1404 can also write two second output coefficients to two first coefficient storage blocks according to the same address.
As shown in
As shown in
As shown in
As shown in
In some embodiments, in the first to fourth rounds of second calculation procedures in Step S270, the processing units 1401 to 1404 can also read corresponding coefficients from the second coefficient storage blocks 1221 to 1228 according to the addresses and orders shown in
As shown in
As shown in
As shown in
As shown in
In some embodiments, in the first to fourth rounds of second writing procedures in Step S280, the processing units 1401 to 1404 can also write two second output coefficients in the first coefficient storage blocks 1121 to 1128 according to the addresses and orders shown in
Further, as shown in
Further, as shown in
In the present embodiment, in addition to accessing the first coefficient memory 110 and the second coefficient memory 120 according to a specific order to simplify the wiring connections between 2M processing units 140 and the first coefficient memory 110 and the second coefficient memory 120 and reducing the operational complexity, the calculator 100 can also store the the twiddle factors ω[1] to ω[2N−1] required for the number-theoretic transformation algorithm according to a specific order.
According to the number-theoretic transformation algorithm of
Further, when the number of twiddle factors required for a specific round of coefficient computation operation is not greater than the total number of rounds (that is, 2(N−M−1) rounds) of calculation procedures to be performed in each coefficient computation operation, the 2M processing units 140 may use one single twiddle factor in each round of first calculation procedures or second calculation procedure, whereas different twiddle factors can be used in different rounds of first calculation procedures or second calculation procedures. In such case, the 2M processing units 140 can still read corresponding twiddle factors from the same twiddle factor storage block.
However, when the number of twiddle factors required for one coefficient computation operation is greater than the total number of rounds (that is, 2(N−M−1) rounds) of calculation procedures to be performed in each coefficient computation operation, different processing units 140 may simultaneously use different twiddle factors in each calculation procedure to perform modulo calculations so as to maintain the performance of parallel computation; in such case, because each twiddle factor storage block has only one read/write terminal, the 2M processing units 140 must read the required twiddle factors from different twiddle factor storage blocks.
In such case, to maintain the performance of the parallel computation of the processing units 140 and hardware usage rate of the twiddle factor memory 130, the twiddle factor memory 130 can have 2M twiddle factor storage blocks, which include a first twiddle factor storage block for storing (2(N-M)−1+2(N−M−1)×M) twiddle factors, and 2(M−i)ith twiddle factor storage blocks for storing (2(N−M−1)×i) twiddle factors, wherein i is an integer between 1 and M. In this way, in the first (N−M) rounds of coefficient computation operations of the N rounds of coefficient computation operations, the 2M processing units 140 can read at least one twiddle factor from the first twiddle factor storage block, whereas in the last k rounds of coefficient computation operations of the N rounds of coefficient computation operations, the 2M processing units 140 can read 2(N−K) twiddle factors from the 2(M−K+1) twiddle factor storages blocks of the 2M twiddle factor storage blocks, wherein k is an integer between 1 and M.
In such case, as shown in
In the second coefficient computation operation SB, since only two twiddle factors are required, in the first calculation procedures PD1B and PD2B of the second coefficient computation operation SB, the processing units 1401 to 1404 can read the twiddle factor ω[2] from the twiddle factor storage block 1321, and in the first calculation procedures PD3B and PD4B, the processing units 1401 to 1404 can read the twiddle factor ω[3] from the twiddle factor storage block 1321, and so on so forth. Accordingly, in the four rounds of first calculation procedures PD1C, PD2C, PD3C and PD4C of the third coefficient computation operation SC, the processing units 1401 to 1404 can sequentially read twiddle factors ω[4], ω[5], ω[6] and ω[7] from the twiddle factor storage blocks 1321.
Next, in the fourth coefficient computation operation SD, since the number of required twiddle factors exceeds the total number 2(N−M−1) of the calculation procedure required to be performed in each coefficient computation operation; that is, the number of the required twiddle factors is greater than 4, in the four rounds of first calculation procedures PD1D, PD2D, PD3D and PD4D of the fourth coefficient computation operation SD, the processing units 1401 and 1403 can sequentially read twiddle factors ω[8], ω[9], ω[10] and ω[11] from the twiddle factor storage block 1321, whereas the processing units 1402 and 1404 can sequentially read twiddle factors ω[12], ω[13], ω[14] and ω[15] from the twiddle factor storage block 1322.
Lastly, in the four rounds of first calculation procedures PD1E, PD2E, PD3E and PD4E of the fifth coefficient computation operation SE, the processing unit 1401 can sequentially read twiddle factors ω[16], ω[17], ω[18] and ω[19] from the twiddle factor storage block 1321, the processing unit 1402 can sequentially read twiddle factors ω[20], ω[21], ω[22] and ω[23] from the twiddle factor storage block 1322, the processing unit 1403 can sequentially read twiddle factors ω[24], ω[25], ω[26] and ω[27] from the twiddle factor storage block 1323, whereas the processing unit 1404 can sequentially read twiddle factors ω[28], ω[29], ω[30] and ω[31] from the twiddle factor storage block 1324.
As a result, the parallel computation performance of the processing unit 140 can be maintained without unnecessarily increase the capacity of the twiddle factor memory 130.
Further, in each round of calculation procedure in Step S250 and 270, each of the processing units 1401 to 1404 may perform the calculation in the third layer of the for-loop as shown in
As shown in
For example, in the first calculation procedure of the first round of coefficient computation operation, the processing unit 1401 can read two first coefficients P[0] and P[16] from the first coefficient storage blocks 1121 and 1125 and can read the corresponding first twiddle factor ω[1] from the twiddle factor memory 130. The modular multiplication unit 142 can perform a modular multiplication calculation on the first coefficient P[16] and the first twiddle factors ω[1] according to a predetermined modulus q to generate a first value V. Next, the modular addition unit 144 can perform a modular addition calculation on the first coefficient P[0] and the first value V according to the predetermined modulus q to generate a first to-be-arranged coefficient P′[16], whereas the modular subtraction unit 146 can perform a modular subtraction calculation on the first coefficient P[0] and the first value V according to the predetermined modulus q to generate a second to-be-arranged coefficient P′[0].
In the present embodiment, the processing unit 1401 may not directly perform the first writing procedure after generating the to-be-arranged coefficients P′[0] and P′[16] and directly write the to-be-arranged coefficientz P′[0] and P′[16] in the second coefficient memory 120, in order to maintain the order of storing each coefficient in the second coefficient memory 120 so that in the calculation procedures of the second round of coefficient computation operation, the processing units 1401 to 1404 can still read coefficients from the second coefficient memory 120 according to the addresses and order shown in
In the present embodiment, the multiplexer MUX1 alternately outputs data received by the first input terminal of multiplexer MUX1 and data received by the second input terminal of multiplexer MUX1. Further, when the multiplexer MUX1 outputs the data received by the first input terminal of the multiplexer MUX1, the multiplexer MUX2 may output the data received by the second input terminal of the multiplexer MUX2; when the multiplexer MUX1 outputs the data received by the second input terminal of the multiplexer MUX1, the multiplexer MUX2 may output the data received by the first input terminal of the multiplexer MUX2.
In such case, the coefficient exchange unit 148 may alternately output the to-be-arranged coefficients obtained by calculation in the calculation procedure by the processing unit 1401. For example, if the processing unit 1401 follows the order shown in
As a result, in the second round of coefficient computation operation, the processing unit 1401 can follow the same addresses and order according to
In other words, the processing units 1401 to 1404 can rearrange the order of output coefficients with the coefficient exchange unit 148, such that in each calculation procedure, the processing units 1401 to 1404 can obtain corresponding coefficients from the first coefficient memory 110 or the second coefficient memory 120 according to the addresses and order of
In view of the foregoing, the calculator and calculation method of the present disclosure can perform modulo calculations of number-theoretic transformation using multiple processing units in parallel, and can access the data in two coefficient memories according to a specific order, thereby simplifying the wirings between the processing units and coefficient memories and improving the overall computation performance thereof.
The foregoing description briefly sets forth the features of some embodiments of the present application so that persons having ordinary skill in the art more fully understand the various aspects of the disclosure of the present application. It may be apparent to those having ordinary skill in the art that they can easily use the disclosure of the present application as a basis for designing or modifying other processes and structures to achieve the same purposes and/or benefits as the embodiments herein. It should be understood by those having ordinary skill in the art that these equivalent implementations still fall within the spirit and scope of the disclosure of the present application and that they may be subject to various variations, substitutions, and alterations without departing from the spirit and scope of the present disclosure.
Number | Date | Country | Kind |
---|---|---|---|
202210608229.8 | May 2022 | CN | national |