The present invention generally relates to a device for processing homomorphically encrypted data and a system comprising the device, and more particularly, a hardware implementation of discrete Galois transform (DGT) and/or inverse discrete Galois transform (iDGT) operations for processing homomorphically encrypted data.
The long latency of conventional implementations of fully homomorphic encryption (FHE) multiplication prevents FHE from being widely used to solve a wide range of privacy-preserving computing problems in cloud and untrusted servers for allowing arbitrary operations to be performed on encrypted data without the need for the decryption key at any stage of computation. For example, the situation is worse in artificial intelligence (AI) as it demands more computing-intense data processing.
For example, a residual numeral system (RNS) implementation of homomorphic multiplication may frequently call discrete Galois transform (DGT) and inverse discrete Galois transform (iDGT) operations which are some of the most computationally intensive operations. In particular, these types of computations require O(n log n) complexity on top of other operations that run in O(n).
DGT and iDGT operations with conventional CPU (central processing unit) implementation are slower than those with conventional GPU (graphics processing unit) implementation due to the GPU having more Float Point (FP) cores for more parallel computations. However, the latency of the conventional GPU implementation, although better than the conventional CPU implementation, is still too long for practical considerations in homomorphic operations (e.g., homomorphic multiplication), especially for AI applications which demand huge computations and heavy data movement between FP cores and memory units.
A need therefore exists to provide a device for processing homomorphically encrypted data that seek to overcome, or at least ameliorate, one or more of deficiencies of conventional device for processing homomorphically encrypted data, such as but not limited to, improving performance and/or throughput in processing homomorphically encrypted data, and more particularly, in relation to a hardware implementation of DGT and/or iDGT operations for processing homomorphically encrypted data. It is against this background that the present invention has been developed.
According to a first aspect of the present invention, there is provided a device for processing homomorphically encrypted data, the device comprising:
According to a second aspect of the present invention, there is provided a system comprising:
According to a third aspect of the present invention, there is provided a method of forming a device for processing homomorphically encrypted data, the method comprising:
Embodiments of the present invention will be better understood and readily apparent to one of ordinary skill in the art from the following written description, by way of example only, and in conjunction with the drawings, in which:
Various embodiments of the present invention provide a device for processing homomorphically encrypted data and a system comprising the device, and more particularly, relating to a hardware implementation of discrete Galois transform (DGT) and/or inverse discrete Galois transform (iDGT) operations for processing homomorphically encrypted data in relation to a homomorphic operation (e.g., a homomorphic multiplication operation).
For example, as explained in the background, the long latency of conventional implementations of fully homomorphic encryption (FHE) multiplication prevents FHE from being widely used to solve a wide range of privacy-preserving computing problems in cloud and untrusted servers for allowing arbitrary operations to be performed on encrypted data without the need for the decryption key at any stage of computation. For example, a residual numeral system (RNS) implementation of homomorphic multiplication may frequently call DGT and iDGT operations which are some of the most computationally intensive operations. For example, there exist conventional CPU and GPU implementations of DGT and iDGT operations with conventional CPU (central processing unit) implementation are slower than those with conventional GPU (graphics processing unit) implementation due to the GPU having more Float Point (FP) cores for more parallel computations. However, the latency of the conventional GPU implementation, although better than the conventional CPU implementation, is still too long for practical considerations in homomorphic operations (e.g., homomorphic multiplication), especially for AI applications which demand huge computations and heavy data movement between FP cores and memory units.
Accordingly, various embodiments of the present invention provide a device for processing homomorphically encrypted data that seek to overcome, or at least ameliorate, one or more of deficiencies of conventional device for processing homomorphically encrypted data, such as but not limited to, improving performance and/or throughput in processing homomorphically encrypted data, and more particularly, in relation to a hardware implementation of DGT and/or iDGT operations for processing homomorphically encrypted data in relation to a homomorphic operation (e.g., homomorphic multiplication).
Accordingly, the device 100 according to various embodiments is advantageously configured with a pipeline architecture having single cycle initiation interval, thereby resulting in improved performance and/or throughput in processing homomorphically encrypted data, and more particularly, in relation to a hardware implementation of DGT and/or iDGT operations for processing homomorphically encrypted data. These advantages or technical effects, or other advantages or technical effects, will become more apparent to a person skilled in the art as the device 100 is described in more details according to various embodiments and example embodiments of the present invention.
In various embodiments, the device 100 further comprises a data point arranging block configured to receive the plurality of parallel input data points derived from the homomorphically encrypted data and arrange the plurality of parallel input data points received into the plurality of columns of input data points to form the matrix of input data points.
In various embodiments, each inter-line butterfly array block of the plurality of inter-line butterfly array blocks 108-1, . . . , 108-n comprises a plurality of first-in-first-out (FIFO) input data buffers. In this regard, for the above-mentioned each inter-line butterfly array block, each FIFO input data buffer of the plurality of FIFO input data buffers is communicatively coupled to the plurality of inter-line modulus butterfly units of the inter-line butterfly array block and is configured to receive a plurality of columns of data points and output each of the plurality of columns of data points to the plurality of inter-line modulus butterfly units column-by-column in FIFO order for each of the plurality of inter-line modulus butterfly units to perform the above-mentioned modulus butterfly operation based on the computation pair of data points received.
In various embodiments, for the above-mentioned each inter-line butterfly array block: the above-mentioned each of the plurality of columns of data points has a number of data points being half of the number of input data points in a column of the plurality of columns of input data points, and the plurality of inter-line modulus butterfly units of the inter-line butterfly array block has a number of inter-line modulus butterfly units being half of the number of input data points in the column of the plurality of columns of input data points.
In various embodiments, for the above-mentioned each inter-line butterfly array block: the inter-line butterfly array block comprises a first set of multiplexer units, each FIFO input data buffer of the plurality of FIFO input data buffers of the inter-line butterfly array being communicatively coupled to the plurality of inter-line modulus butterfly units of the inter-line butterfly array block via a multiplexer unit of the first set of multiplexer units, and the clock counter 120 is communicatively coupled to each multiplexer unit of the first set of multiplexer units for controlling the inter-line butterfly array block to operate with single cycle initiation interval.
In various embodiments, a first FIFO input data buffer and a third FIFO input data buffer of the plurality of FIFO input data buffers of the inter-line butterfly array block are each communicatively coupled to a first multiplexer unit of the first set of multiplexer units. In this regard, the first multiplexer unit is configured to output a column of data points of the plurality of columns of data points from a selected FIFO input data buffer amongst the first and third FIFO input data buffers to the plurality of inter-line modulus butterfly units of the inter-line butterfly array block. In this regard, the selected FIFO input data buffer is selected based on the counter signal received by the first multiplexer unit of the first set of multiplexer units.
In various embodiments, a second FIFO input data buffer and a fourth FIFO input data buffer of the plurality of FIFO input data buffers of the inter-line butterfly array block are each communicatively coupled to a second multiplexer unit of the first set of multiplexer units. In this regard, the second multiplexer unit is configured to output a column of data points of the plurality of columns of data points from a selected FIFO input data buffer amongst the second and fourth FIFO input data buffers to the plurality of inter-line modulus butterfly units of the inter-line butterfly array block. In this regard, the selected FIFO input data buffer is selected based on the counter signal received by the second multiplexer unit of the first set of multiplexer units.
In various embodiments, for each inter-line butterfly array block from a first inter-line butterfly array block to a penultimate inter-line butterfly array block of the plurality of inter-line butterfly array blocks 108-1, . . . , 108-n: the inter-line butterfly array block further comprises a second set of multiplexer units, each FIFO input data buffer of the plurality of FIFO input data buffers of an immediately subsequent inter-line butterfly array block of the plurality of inter-line butterfly array blocks 108-1, . . . , 108-n with respect to the inter-line butterfly array block is communicatively coupled to the plurality of inter-line modulus butterfly units of the inter-line butterfly array block via a multiplexer unit of the second set of multiplexer units. In various embodiments, the clock counter 120 is communicatively coupled to each multiplexer unit of the second set of multiplexer units for controlling the inter-line butterfly array block to operate with single cycle initiation interval.
In various embodiments, the first FIFO input data buffer and the second FIFO input data buffer of the plurality of FIFO input data buffers of the immediately subsequent inter-line butterfly array block are each communicatively coupled to a first multiplexer unit of the second of multiplexer units. In this regard, the first multiplexer unit is configured to output a first portion of a column of data points from the plurality of inter-line modulus butterfly units of the inter-line butterfly array block to a selected FIFO input data buffer amongst the first and second FIFO input data buffers. In this regard, the selected FIFO input data buffer is selected based on the counter signal received by the first multiplexer unit of the second set of multiplexer units.
In various embodiments, the third FIFO input data buffer and the fourth FIFO input data buffer of the plurality of FIFO input data buffers of the immediately subsequent inter-line butterfly array block are each communicatively coupled to a second multiplexer unit of the second of multiplexer units. In this regard, the second multiplexer unit being configured to output a second portion of a column of data points from the plurality of inter-line modulus butterfly units of the inter-line butterfly array block to a selected FIFO input data buffer amongst the third and fourth FIFO input data buffers. In this regard, the selected FIFO input data buffer being selected based on the counter signal received by the second multiplexer unit of the second set of multiplexer units.
In various embodiments, the first inter-line butterfly array block further comprises a third set of multiplexer units. In various embodiments, the first and second FIFO input data buffers of the plurality of FIFO input data buffers of the first inter-line butterfly array block are each communicatively coupled to a first multiplexer unit of the third set of multiplexer units. In this regard, the first multiplexer unit is configured to output a first portion of a column of data points from an input register to a selected FIFO input data buffer amongst the first and second FIFO input data buffers. In this regard, the selected FIFO input data buffer being selected based on the counter signal received by the first multiplexer unit of the third set of multiplexer units.
In various embodiments, the third and fourth FIFO input data buffers of the plurality of FIFO input data buffers of the first inter-line butterfly array block are each communicatively coupled to a second multiplexer unit of the third set of multiplexer units. In this regard, the second multiplexer unit is configured to output a second portion of the column of data points from the input register to a selected FIFO input data buffer amongst the third and fourth FIFO input data buffers. In this regard, the selected FIFO input data buffer is selected based on the counter signal received by the second multiplexer unit of the third set of multiplexer units. In various embodiments, the clock counter 120 is communicatively coupled to each multiplexer unit of the third set of multiplexer units for controlling the first inter-line butterfly array block to operate with single cycle initiation interval.
In various embodiments, a last inter-line butterfly array block of the plurality of inter-line butterfly array blocks 108-1, . . . , 108-n further comprises: a second set of multiplexer units; a third multiplexer unit; and a plurality of FIFO output data buffers. In various embodiments, each FIFO output data buffer of the plurality of FIFO output data buffers of the inter-line butterfly array is communicatively coupled to the plurality of inter-line modulus butterfly units of the last inter-line butterfly array block via a multiplexer unit of the second set of multiplexer units. Furthermore, the third multiplexer unit is configured to output a column of data points from a selected FIFO output data buffer amongst the plurality of FIFO output data buffers. In this regard, the selected FIFO output data buffer being selected based on the counter signal received by the third multiplexer unit. In various embodiments, the clock counter 120 is communicatively coupled to each multiplexer unit of the second set of multiplexer units and the third multiplexer unit for controlling the last inter-line butterfly array block to operate with single cycle initiation interval.
In various embodiments, each intra-line butterfly array block of the plurality of intra-line butterfly array blocks 112-1, . . . , 112-n comprises an input register. In this regard, for the above-mentioned each intra-line butterfly array block, the input register is communicatively coupled to the plurality of intra-line modulus butterfly units of the intra-line butterfly array block and is configured to receive a column of data points and output the column of data points to the plurality of intra-line modulus butterfly units for each of the plurality of intra-line modulus butterfly units to perform said modulus butterfly operation based on the computation pair of data points received.
In various embodiments, the device 100 further comprises a weight modulus multiplication block comprising a plurality of modulus multiplication units, each modulus multiplication unit being configured to perform a modulus multiplication operation based on a data point received.
In various embodiments (e.g., in the case of the pipeline being configured to perform a DGT operation), the plurality of intra-line butterfly array blocks 112-1, . . . , 112-n are arranged after (i.e., subsequent to) the plurality of inter-line butterfly array blocks 108-1, . . . , 108-n in pipeline. In various embodiments, the pipeline is configured to perform a DGT of the plurality of parallel input data points.
In various embodiments (e.g., in the case of the pipeline being configured to perform a iDGT operation), the plurality of inter-line butterfly array blocks 108-1, . . . , 108-n are arranged after (i.e., subsequent to) the plurality of intra-line butterfly array blocks 112-1, . . . , 112-n in the pipeline. In various embodiments, the pipeline is configured to perform an iDGT of the plurality of parallel input data points.
In various embodiments, the plurality of parallel input data points has 2n number of parallel input data points. In this regard, the matrix has 2r number of rows of input data points and 2n−r number of columns of input data points, wherein n≥4, r≥2 and r<n. Furthermore, the plurality of inter-line butterfly array blocks 108-1, . . . , 108-n has q number of inter-line buttery array blocks, wherein q=n−r, and the plurality of intra-line butterfly array blocks 112-1, . . . , 112-n has r number of inter-line buttery array blocks.
In various embodiments, the device 100 is a field-programmable gate array (FPGA) device or an application specific integrated circuit (ASIC) device.
In various embodiments, the method 400 is for forming the device 100 as described hereinbefore with reference to
In various embodiments, the device 100 is formed as an FPGA device (integrated circuit) by configuring the FPGA device as described herein with respect to the device 100 according to various example embodiments. In various embodiments, the device 100 is formed as an ASIC device (integrated circuit) by configuring the ASIC device as described herein with respect to the device 100 according to various example embodiments. In various embodiments, the system 200 may also be embodied as a device or an apparatus.
A computing system, a controller, a microcontroller or any other system providing a processing capability may be presented according to various embodiments in the present disclosure. Such a system may be taken to include one or more processors and one or more computer-readable storage mediums. For example, the system 200 described hereinbefore may include a processor (or controller) 208 and a computer-readable storage medium (or memory) 204 which are for example used in various processing carried out therein as described herein. A memory or computer-readable storage medium used in various embodiments may be a volatile memory, for example a DRAM (Dynamic Random Access Memory) or a non-volatile memory, for example a PROM (Programmable Read Only Memory), an EPROM (Erasable PROM), EEPROM (Electrically Erasable PROM), or a flash memory, e.g., a floating gate memory, a charge trapping memory, an MRAM (Magnetoresistive Random Access Memory) or a PCRAM (Phase Change Random Access Memory).
In various embodiments, a “circuit” may be understood as any kind of a logic implementing entity, which may be special purpose circuitry. Thus, in various embodiments, a “circuit” may be a hard-wired logic circuit or a programmable logic circuit such as a programmable processor, e.g., a microprocessor (e.g. a Complex Instruction Set Computer (CISC) processor or a Reduced Instruction Set Computer (RISC) processor), a ASIC or a FPGA.
Some portions of the present disclosure are explicitly or implicitly presented in terms of algorithms and functional or symbolic representations of operations on data within a computer memory. These algorithmic descriptions and functional or symbolic representations are the means used by those skilled in the data processing arts to convey most effectively the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of steps leading to a desired result. The steps are those requiring physical manipulations of physical quantities, such as electrical, magnetic or optical signals capable of being stored, transferred, combined, compared, and otherwise manipulated.
Unless specifically stated otherwise, and as apparent from the following, it will be appreciated that throughout the present specification, discussions utilizing terms such as “sending”, “receiving”, “controlling”, “executing” or the like, refer to the actions and processes of a computer system, or similar electronic device, that manipulates and transforms data represented as physical quantities within the computer system into other data similarly represented as physical quantities within the computer system or other information storage or transmission devices.
The present specification also discloses a system (e.g., which may also be embodied as a device or an apparatus), such as the system 200, for performing the operations/functions of various methods described herein. Such a system or apparatus may be specially constructed for the required purposes, or may comprise a general purpose computer or other device selectively activated or reconfigured by a computer program stored in the computer. The algorithms presented herein are not inherently related to any particular computer or other apparatus. Various general purpose machines may be used with computer programs in accordance with the teachings herein. Alternatively, the construction of more specialized apparatus to perform the required method steps may be appropriate.
In addition, the present specification also at least implicitly discloses a computer program or software/functional module, in that it would be apparent to the person skilled in the art that individual steps of various methods (e.g., the method 300 of operating the system 200) described herein may be put into effect by computer code. The computer program is not intended to be limited to any particular programming language and implementation thereof. It will be appreciated that a variety of programming languages and coding thereof may be used to implement the methods/techniques of the disclosure contained herein. Moreover, the computer program is not intended to be limited to any particular control flow. There are many other variants of the computer program, which can use different control flows without departing from the scope of the present invention. It will be appreciated to a person skilled in the art that various modules may be software module(s) realized by computer program(s) or set(s) of instructions executable by a computer processor to perform the required functions, or may be hardware module(s) being functional hardware unit(s) designed to perform the required functions. It will also be appreciated that a combination of hardware and software modules may be implemented.
In addition, the present specification also at least implicitly discloses a computer program or software/functional module, in that it would be apparent to the person skilled in the art that individual steps of various methods described herein may be put into effect by computer code. The computer program is not intended to be limited to any particular programming language and implementation thereof. It will be appreciated that a variety of programming languages and coding thereof may be used to implement the teachings of the disclosure contained herein. Moreover, the computer program is not intended to be limited to any particular control flow. There are many other variants of the computer program, which can use different control flows without departing from the scope of the invention.
Furthermore, one or more of the steps of the computer program/module or method may be performed in parallel rather than sequentially. Such a computer program may be stored on any computer readable medium. The computer readable medium may include storage devices such as magnetic or optical disks, memory chips, or other storage devices suitable for interfacing with a general purpose computer. The computer program when loaded and executed on such a general-purpose computer effectively results in an apparatus that implements steps of various methods described herein.
In various embodiments, there is provided a computer program product, embodied in one or more computer-readable storage mediums (non-transitory computer-readable storage medium), comprising instructions executable by one or more computer processors (e.g., the processor 208) to perform a method 300 of operating the system 200 as described hereinbefore with reference to
Various software or functional modules described herein may also be implemented as hardware modules. More particularly, in the hardware sense, a module is a functional hardware unit designed for use with other components or modules. For example, a module may be implemented using discrete electronic components, or it can form a portion of an entire electronic circuit such as an Application Specific Integrated Circuit (ASIC). Numerous other possibilities exist.
In various embodiments, the system 200 may be realized by or embodied as any computer system (e.g., desktop or portable computer system) including at least one processor and a memory, such as a computer system 500 as schematically shown in
It will be appreciated by a person skilled in the art that the terminology used herein is for the purpose of describing various embodiments only and is not intended to be limiting of the present invention. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
Any reference to an element or a feature herein using a designation such as “first”, “second” and so forth does not limit the quantity or order of such elements or features, unless stated or the context requires otherwise. For example, such designations may be used herein as a convenient way of distinguishing between two or more elements or instances of an element. Thus, unless stated or the context requires otherwise, a reference to first and second elements does not necessarily mean that only two elements can be employed, or that the first element must precede the second element. In addition, a phrase referring to “at least one of” a list of items refers to any single item therein or any combination of two or more items therein.
In order that the present invention may be readily understood and put into practical effect, various example embodiments of the present invention will be described hereinafter by way of examples only and not limitations. It will be appreciated by a person skilled in the art that the present invention may, however, be embodied in various different forms or configurations and should not be construed as limited to the example embodiments set forth hereinafter. Rather, these example embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the present invention to those skilled in the art.
Various example embodiments provide a single cycle initiation interval (II) pipeline architecture of iDGT and DGT, which is implemented in hardware, such as FPGA. It will be understood by a person skilled in the art that the hardware implementation of the single cycle initiation interval pipeline architecture of iDGT and DGT disclosed according to various example embodiments is not limited to FPGA, and may be implemented as other types of integrated circuits in the field of very large-scale integration (VLSI) as desired or as appropriate, such as but not limited to, ASIC. Various example embodiments find that such a customized hardware design (single cycle initiation interval pipeline architecture) can achieve a lower latency of homomorphic operation (e.g., homomorphic multiplication) on homomorphically encrypted data according to simulations on, for example, FPGA, even at low frequencies such as 200 MHz. As known by a person skilled in the art, the initiation interval is the number of cycle(s) that must elapse between issuing two operations of a given type, or in other words, the number of cycle(s) between new data inputs to an operation or a function. Therefore, a single cycle of initiation interval means that a function or an operation can accept data in the next cycle without further delay. As an example, Table 1 below shows a performance comparison amongst an example FPGA implementation according to various example embodiments of the present invention, a conventional CPU implementation and a conventional GPU implementation of DGT and iDGT for n=216 (where n denotes the number of parallel input data points) with respect to the respective latency achieved in performing homomorphic multiplication:
For example, it may be necessary for DGT and iDGT to handle both big data movement and extensive modular arithmetic operations. In this regard, the conventional GPU implementation has improved performances over the conventional CPU implementation by using more parallel FP cores on die compared with the conventional CPU implementation. Moreover, the conventional GPU implementation is embedded with high-speed low latency local memory in order to reduce the latency of data transfer. However, the conventional GPU implementation is known for their limited local memory size which is not expandable.
In contrast, various example embodiments provide a single cycle initiation interval pipeline hardware architecture of DGT and iDGT capable of handling FHE data transfer without unnecessary bubbles in pipeline to achieve lower latency. For example, the pipeline can be tailed with variable parallel modulus multiplication units to match the data throughput and latency. Furthermore, multiple pipelines of DGT and iDGT are able to operate in parallel to increase performance of homomorphic operation (e.g., homomorphic multiplication). Various example embodiments find that the single cycle initiation interval pipeline hardware architecture is advantageously cycle accurate and can be scaled without the latency penalty which the conventional GPU implementation suffers from.
As an example overview,
As shown in
At stage 2, computations are performed according to Equation (2) below:
At the stage 3, computations are performed according to Equation (3) below:
At the stage 4, computations are performed according to Equation (4) below:
In relation to the DGT data flow, various example embodiments made the following observations. As a first observation (observation 1), each computation pair (or calculation pairs) of data points at stage 1 is between a data point in the first half of parallel data points and a data point in the second half of the parallel data points with a distance of 23 to the data point in the first half. In this example, 23 is the half data length.
As a second observation (observation 2), each computation pair of data points at stage 2 is always either within the first half or within the second half of parallel data points, whereby each computation pair of data points within the first half or the second half is computed in the manner as described in the above-mentioned observation 1 but with half of the distance (i.e., 22) between computation pairs compared to the immediately previous stage (i.e., stage 1).
As a third observation (observation 3), each computation pair at the next stage (e.g., stage 3) is always either within the first half or within the second half of parallel data points, whereby each computation pair of data points within the first half or the second half is computed in the manner as described in the above-mentioned observation 1 but with half of the distance between computation pairs compared to the immediately previous stage (e.g., stage 2).
For performing DGT on a plurality of parallel data points with data size of 2n, n stages of hardware may be designed in a pipeline to process the plurality of parallel data points. However, various example embodiments note that it may not be feasible to parallelize 2n of data points to perform various homomorphic operations, and in particular, multiplication modulo, when n is too large. For example, hardware architecture is restricted by resources and throughput. To address this problem, various example embodiments parallelize 2r (r<n) data points, and more specifically, a plurality (e.g., 2n) of parallel data points are arranged into a plurality (e.g., 2n−r) of columns of data points (parallel data points) to form a matrix of input data points to be processed as illustrated in
Various example embodiments found that at the last r stages of DGT, each computation pair of data points is within the same column of data points. In this regard, various example embodiments provide r stages of intra-line (which may also be referred to as intra-column) butterfly array blocks (which may also be referred to as intra-line multiplication modulo array blocks or modules) configured or designed to solve or perform the corresponding computations (or calculations), whereby each stage (or each intra-line butterfly array block) has 2r−1 parallel intra-line modulus butterfly units (which may also be referred to as multiplication modulo units). In contrast, from stages 0 to n−r−1 of DGT, various example embodiments found that the gap of each computation pair of data points across two columns of data points is 2n−r−1−s at stage s, s∈[0, n−r−1]. In this regard, various example embodiments further found that each computation pair of data points across two columns should be at the same row as the gap of each computation pair of data points must be 2n−1−s at stage s, otherwise the gap is not equal to 2n−1−s at stage s. Accordingly, various example embodiments found that each computation pair is either within a same column of data points or within a same row of data points. Accordingly, these findings advantageously establish that a plurality of columns of input data points can be input sequentially one column by one column (i.e., column by column) and processed in such a way that all computation pairs at the same row are computed or calculated in the same butterfly unit, which facilitates or enables the development of the single cycle initiation interval pipeline hardware architecture of DGT (and similarly for the single cycle initiation interval pipeline hardware architecture of iDGT) as described herein according to various example embodiments of the present invention, which will be described in further detail below.
According to various example embodiments, as input data points are fed in one column by one column 816, each of the plurality of inter-line butterfly array blocks 808 comprises four FIFO input buffers for achieving single cycle initiation interval pipeline with a consistent throughput. As the plurality of inter-line butterfly array blocks 808 are arranged and communicatively coupled in series, two adjacent inter-line butterfly array blocks may share one or more common FIFO buffers, that is, one or more FIFO input buffers of an immediately subsequent inter-line butterfly block with respect to an inter-line butterfly block may also function or serve as one or more FIFO output buffers of the inter-line butterfly block. For example, the total data throughput of the single cycle initiation interval pipeline may be expressed as:
Total data throughput (MB/s)=Clock speed (MHz)×Data points×bytes per data point (Byte) (Equation 5)
As an example overview,
As shown in
Similar to the DGT as described hereinbefore, for the iDGT according to various example embodiments, a plurality (e.g., 2n) of parallel data points are arranged into a plurality (e.g., 2n−r) columns of data points (parallel data points) (forming or constituting a matrix of data points) to be processed. Furthermore, corresponding to the DGT as described hereinbefore, for the first r stages of iDGT, each computation pair of data points are within the same column of data points. In this regard, various example embodiments provide a plurality (e.g., r stages) of intra-line butterfly array blocks (which may also be referred to as intra-line multiplication modulo array modules or circuits) configured or designed to solve or perform the corresponding computations (or calculations), whereby each stage has 2r−1 parallel intra-line modulus butterfly units (which may also be referred to as multiplication modulo units). In contrast, from the subsequent q stages, various example embodiments found that the gap of computation pairs of data points across two columns of data points is 2r+s at stage s, s∈[0,n−r−1]. In this regard, various example embodiments further found that the computation pairs of data points across two columns should be at the same row as the gap of computation pairs of data points must be 2r+s at stage s, otherwise the gap is not equal to 2r+s at stage s. Accordingly, various example embodiments found that the computation pairs are either within the same column of data points or within the same row of data points. Accordingly, similar to the DGT as described hereinbefore, these findings advantageously establish that a plurality of parallel input data points can be input sequentially one column by one column (i.e., column by column) and processed in such a way that all computation pairs at the same row are computed or calculated in the same butterfly unit, which facilitates or enables the development of the single cycle initiation interval pipeline hardware architectures of iDGT as described herein according to various example embodiments of the present invention, which will be described in further detail below.
According to various example embodiments, similar to the DGT circuit 800 described hereinbefore, as input data points are fed in one column by one column 1016, each of the plurality of inter-line butterfly array blocks 1008 comprises four FIFO input buffers for achieving single cycle initiation interval pipeline with a consistent throughput. As the plurality of inter-line butterfly array blocks 1008 are arranged and communicatively coupled in series, two adjacent inter-line butterfly array blocks may share one or more common FIFO buffers, that is, one or more FIFO input buffers of an immediately subsequent inter-line butterfly block with respect to an inter-line butterfly block may also function or serve as one or more FIFO output buffers of the inter-line butterfly block.
In various example embodiments, L points of data are fed in parallel into the pipeline with single cycle of initiation interval through q stages of inter-line butterfly array blocks and r stages of intra-line butterfly array blocks. This achieves a throughput of L×clock. In various example embodiments, to prevent pipeline bubble, all the stages of the DGT circuit 800 are configured to have the same throughput, and similarly, all the stages of the iDGT circuit 1000 are configured to have the same throughput. In this regard, various example embodiments note that, at the q stages of inter-line butterfly array blocks, if a two-buffer pipeline 1110 as shown in
For example, the FPGA device 1250 may be a FPGA card (or FPGA board) comprising (or disposed thereon) at least one global memory chip 1270 and at least one FPGA chip 1254 which implements the processing units of DGT and/or iDGT (the DGT circuit 800 and/or the iDGT circuit 1000). The FPGA device 1250 may further comprise a PCIe bus 1258 with PCI controller logics 1262, a memory controller 1266 for accessing the FPGA global memory 1270. The FPGA device 1250 may be plugged into a host computer 1202. The input data 1212 in host computer memory 1204 may be transferred to the FPGA global memory 1270 in a FIFO order. The input data may then be transferred from the FPGA global memory 1270 to buffers inside the FPGA device 1250 in a consistent flow based on (or via) the memory controller 1266. After being processed by the processing units of the DGT/iDGT accelerator 1254, the output data 1216 may then be flowed into the FPGA global memory 1270 based on (or via) the memory controller 1266. Thereafter, the output data is transferred to the host memory 1204 of the host computer 1202 by the PCIe controller 1262 via the PCIe bus 1258.
In various example embodiments, the memory controller 1266 (e.g., corresponding to the data point arranging block as described hereinbefore according to various embodiments) may be configured to arrange input data 1212-1 (e.g., a plurality of parallel input data points received from the host computer 1202 via the FPGA global memory 1270) into a plurality of columns of data points 1212-2 to form a matrix of input data points to be processed by the DGT/iDGT accelerator 1254, one column by one column, as described hereinbefore with reference to
For better understanding, example implementations of various stages of the DGT circuit 800 shown in
As shown in
At the inter-line butterfly stage 0 (i.e., inter-line butterfly array block 808-0), L parallel data points may be latched into an input register (R) 1504 (which corresponds to (or common as) the output register 1316 of the weight modulus multiplication block 804. The L parallel data points are then pushed into four input data buffers, namely, In_FIFO A, In_FIFO B, In_FIFO C and In_FIFO D (e.g., corresponding to the first, second, third and fourth FIFO input data buffers, respectively, as described hereinbefore according to various embodiments), via a set of multiplexer units 1506 (corresponding to the third set of multiplexer units of the first inter-line butterfly array block as described hereinbefore according to various embodiments), according to the rule or equation below:
if (C>>(q−1)&1==0)
R[0,L/2-1]→In_FIFO A
R[L/2,L−1]→In_FIFO C
else
R[0,L/2-1]→In_FIFO B
R[L/2,L−1]→In_FIFO D (Equation 6)
Data points are pulled out from In_FIFO A and In_FIFO B in FIFO order, via a set of multiplexer units 1510 (e.g., corresponding to the first set of multiplexer units of the inter-line butterfly array block as described hereinbefore according to various embodiments), and pushed into L/2 modulus butterfly units 1508 when C0 &1==0 where C0=C-Q/2-2. Data points are pulled out from In_FIFO C and In_FIFO D in FIFO order, via the set of multiplexer units 1510, and are pushed into L/2 modulus butterfly units 1508 when C0 &1==1 where C0=C-Q/2-2. Data points at the upper part of output of the modulus butterfly units 1508 (corresponding to a column of data points denoted by A′ produced by the L/2 modulus butterfly units 1508) are pushed into output data buffer Out_0 FIFO A via a first multiplexer unit of a set of multiplexer units 1514 (e.g., corresponding to the second set of multiplexer units of the inter-line butterfly array block as described hereinbefore according to various embodiments), while data points at the lower part of output of modulus butterfly units 1508 (corresponding to a column of data points denoted by B′ produced by the L/2 modulus butterfly units 1508) are pushed into output data buffer Out_0 FIFO C via a second multiplexer unit of the set of multiplexer units 1514, when (C0>>(q−1)) & 1==0 where C0=C-Q/2-2, after being processing by the L/2 modulus butterfly units 1508. Data points at the upper part of output modulus butterfly units 1508 (corresponding to a column of data points denoted by A′ produced by the L/2 modulus butterfly units 1508) are pushed into output data buffer Out_0 FIFO B via the first multiplexer unit of the set of multiplexer units 1514 while data points at the lower part of output of modulus butterfly units 1508 (corresponding to a column of data points denoted by B′ produced by the L/2 modulus butterfly units 1508) are pushed into output data buffer Out_0 FIFO D via the second multiplexer unit of the set of multiplexer units 1514, when (C0>>(q−1)) & 1==1 where C0=C-Q/2-2. The four input data buffers with respect to the inter-line butterfly array block 808-0 (In_FIFO A, In_FIFO B, In_FIFO C and In_FIFO D), each has L/2 number of parallel data points. Similarly, the four output data buffers with respect to the inter-line butterfly array block 808-0 (Out_0 FIFO A, Out_0 FIFO B, Out_0 FIFO C and Out_0 FIFO D), each has L/2 number of parallel data points. With the above-described configuration or setup of the inter-line butterfly stage 0 808-0, the initiation interval at this stage achieves single cycle.
Furthermore, ai, bi, a′i, b′i and mi are complex integers, whereby
p denotes a prime integer and clock denotes a single bit clock. As shown in
At stage i of inter-line butterfly (i.e., inter-line butterfly array block 808-i), data points are pulled out from the previous Out_i−1 FIFO A and Out_i−1 FIFO B in FIFO order (i.e., of the immediately preceding inter-line butterfly array block, which may also be referred to as input data buffer of the current inter-line butterfly array block 808-i), via a set of multiplexer units 1910 (e.g., corresponding to the second set of multiplexer units of the inter-line butterfly array block as described hereinbefore according to various embodiments), and are pushed into L/2 modulus butterfly units 1908 when Ci&1==0 where Ci=C−Ci−1−2−2q−i. Data points are pulled out from Out_i−1 FIFO C and Out_i−1 FIFO D in FIFO order, via the set of multiplexer units 1910, and are pushed into L/2 modulus butterfly units 1908 when Ci&1==1 where Ci=C−Ci−1−2−2q−i. Data points at the upper part of output of the modulus butterfly units 1908 (corresponding to a column of data points denoted by A′ produced by the L/2 modulus butterfly units 1908) are pushed into Out_i FIFO A while data points at the lower part of output of modulus butterfly units 1908 (corresponding to a column of data points denoted by B′ produced by the L/2 modulus butterfly units 1908) are pushed into Out_i FIFO C, via a second multiplexer unit of the set of multiplexer units 1914, when (Ci>>(q−1−i)) & 1==0 where Ci=C−Ci−1−2−2q−i. The data points at upper part of output of the modulus butterfly units 1908 (corresponding to a column of data points denoted by A′ produced by the L/2 modulus butterfly units 1908) are pushed into Out_0 FIFO B while the lower part of output of modulus butterfly units 1908 (corresponding to a column of data points denoted by B′ produced by the L/2 modulus butterfly units 1908) are pushed into Out_0 FIFO D, via the second multiplexer unit of the set of multiplexer units 1914, when (Ci>>(q−1−i)) & 1==1 where Ci=C−Ci−1−2−2q−i. The four output data buffers with respect to the inter-line butterfly array block 808-i, namely, Out_i FIFO A, Out_i FIFO B, Out_i FIFO C and Out_i FIFO D, each has L/2 number of parallel data points, where i∈[1, q−2]. With the above-described configuration or setup of the intermediate inter-line butterfly stage i 808-i, the initiation interval at such a stage achieves single cycle. The configuration and operation of the L/2 modulus butterfly units 1908 are the same or similar as the L/2 modulus butterfly units 1508 as described hereinbefore and thus need not be repeated for conciseness.
At stage q−1 of inter-line butterfly (i.e., inter-line butterfly array block 808-n), data points are pulled out from previous Out_q−1 FIFO A and Out_q−1 FIFO B in FIFO order (i.e., of the immediately preceding inter-line butterfly array block, which may also be referred to as input data buffer of the last inter-line butterfly array block 808-n), via a set of multiplexer units 2010 (e.g., corresponding to the first set of multiplexer units of the inter-line butterfly array block as described hereinbefore according to various embodiments), and are pushed into L/2 modulus butterfly units 2008 when Cq-1 &1==0 where Cq-1=C−Cq-2−2−2. Data points are pulled out from Out_q−1 FIFO C and Out_q−1 FIFO D in FIFO order, via the set of multiplexer units 2010, and are pushed into L/2 modulus butterfly units 2008 when Cq-1 &1==1 where Cq-1=C−Cq-2−2−2. Data points at the upper part of output the modulus butterfly units 2008 (corresponding to a column of data points denoted by A′ produced by the L/2 modulus butterfly units 2008) are pushed into the upper part of L/2 data points (0 to L/2-1) in Out_q−1 FIFO A while the lower part of outputs of the butterfly units (corresponding to a column of data points denoted by B′ produced by the L/2 modulus butterfly units 2008) are pushed into the upper part of L/2 data points in Out_i FIFO B (0 to L/2-1), via a set of multiplexer units 2014 (e.g., corresponding to the second set of multiplexer units of the last inter-line butterfly array block as described hereinbefore according to various embodiments), when Cq-1 & 1==0 where Cq-1=C−Cq-2−2−2. Data points at the upper part of output of butterfly units 2008 (corresponding to a column of data points denoted by A′ produced by the L/2 modulus butterfly units 2008) are pushed into lower part of L/2 points (L/2 to L−1) in Out_0 FIFO A while the lower part of output of butterfly units (corresponding to a column of data points denoted by B′ produced by the L/2 modulus butterfly units 2008) are pushed into lower part of L/2 points (L/2 to L−1) on Out_0 FIFO B, via the set of multiplexer units 2014, when Cq-1 & 1==1 where Cq-1=C− Cq-2−2−2. Input data buffers with respect to the last inter-line butterfly array block 808-n, namely, Out_q−2 FIFO A, Out_q−2 FIFO B, Out_q−2 FIFO C and Out_q−2 FIFO D each has L/2 number of parallel data points, while output data buffers with respect to the last inter-line butterfly array block 808-n, namely, FIFO_q−1 A and FIFO_q−1 B each has L number of parallel data points. Furthermore, data points (0 to L−1) are pulled out from FIFO_q−1 A in FIFO order, via a multiplexer unit 2018 (e.g., corresponding to the third multiplexer unit of the last inter-line butterfly array block as described hereinbefore according to various embodiments) when Cq==0, where Cq=C−Cq-1−2, and data points (0 to L−1) are pulled out from FIFO_q−1 B in FIFO order, via the multiplexer unit 2018 when Cq &1==1 where Cq=C−Cq-1−2. With the above-described configuration or setup of the last inter-line butterfly stage q−1 808-n, the initiation interval at such a stage achieves single cycle. Similarly, the configuration and operation of the L/2 modulus butterfly units 2008 are the same or similar as the L/2 modulus butterfly units 1508 as described hereinbefore and thus need not be repeated for conciseness.
Accordingly, the DGT circuit 100 comprises a plurality of inter-line butterfly array blocks 808, each inter-line butterfly array block comprising a plurality of inter-line modulus butterfly units 1508/1908/2008, each inter-line modulus butterfly unit being configured to perform a modulus butterfly operation based on a computation pair of data points received corresponding to a pair of input data points at a same row of the matrix of input data points (e.g., as illustrated in the data flow of the DGT circuit 800 shown in
At Stage 0 of the intra-line butterfly (i.e., intra-line butterfly array block 812-0), each pair of data points Rin[s] and Rin[s+L/2], along with TF_TAB_q[s], are input to a corresponding modulus butterfly unit 2110 for computation to obtain the result of Rout[s] and Rout[s+L/2] where s∈[0, L/2-1]. As an example, computations by a modulus butterfly unit may be performed according to the following equations:
R
out
[s]=(Rin[s]+Rin[s+L/2])modp (Equation 7)
R
out
[s+L/2]=((Rin[s]−Rin[s+L/2])·TF_TAB_q[s])modp (Equation 8)
In particular,
R
out
[t·2k+s]=(Rin[t·2k+s]+Rin[t·2k+s+k])modp (Equation 9)
R
out
[t·2k+s+k]=((Rin[t·2k+s]−Rin[t·2K+s+k])·TF_TAB_q[s])modp (Equation 10)
Accordingly, the DGT circuit 100 comprises a plurality of intra-line butterfly array blocks 812, each intra-line butterfly array block comprising a plurality of intra-line modulus butterfly units 2110/2210, each intra-line modulus butterfly unit being configured to perform a modulus butterfly operation based on a computation pair of data points received corresponding to a pair of input data points at a same column of the matrix of input data points (e.g., as described hereinbefore with reference to
Furthermore, as described hereinbefore according to various example embodiments, a clock counter 1502 is communicatively coupled to each inter-line butterfly array block of the plurality of inter-line butterfly array blocks 808 and each intra-line butterfly array block of the plurality of intra-line butterfly array blocks 812, and configured to output a counter signal for controlling each inter-line butterfly array block and said each intra-line butterfly array block to operate with single cycle initiation interval. Accordingly, the plurality of inter-line butterfly array blocks 808 and the plurality of intra-line butterfly array blocks 812 are arranged in series to form a pipeline for processing the matrix of input data points.
As explained hereinbefore and as will be understood by a person skilled in the art, the operation or process (or data flow) of iDGT is generally the reverse of that of DGT as described hereinbefore with reference to
W
k
[n]=(W′k[n]·min V)Modp where n∈[0,L−1] (Equation 11)
For completeness,
Furthermore, ai, bi, a′i, b′i and mi are complex integers, whereby
p denotes a prime integer and clock denotes a single bit clock. As shown in
To demonstrate the effectiveness (improved performance) of the DGT circuit 800 and the iDGT circuit 1000 according to various example embodiments of the present invention, simulation results thereof will now be described. For the example simulation, the DGT described according to various example embodiments is designed in C++ code and passed the C simulation and is synthesized by using Xilinx HLS (High Level Synthesis) Tools. The DGT is configured based on m=215, L=64, r=6 and q=9. At the clock frequency of 200 MHz, the latency of DGT achieved was 1439 cycles which is equal to 7.195 us while the initiation interval is 1 cycle, as can be seen from the simulation results shown in
For better understanding and illustrative purpose, a detailed data flow of the DGT circuit 800 for 24 data points (i.e., 16 parallel data points arranged into 4 columns of 4 parallel data points) according to various example embodiments are shown in
While embodiments of the invention have been particularly shown and described with reference to specific embodiments, it should be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. The scope of the invention is thus indicated by the appended claims and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced.
Number | Date | Country | Kind |
---|---|---|---|
10202011663Q | Nov 2020 | SG | national |
This application is a 371 National Stage of International Application No. PCT/SG2021/050723, filed on 24 Nov. 2021, which claims priority to Singapore Patent Application No. 10202011663Q, filed on 24 Nov. 2020, the content of which being hereby incorporated by reference in its entirety for all purposes.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/SG2021/050723 | 11/24/2021 | WO |