This application claims the benefit of Korean Patent Application No. 10-2021-0183937, filed 2021 Dec. 21, which is hereby incorporated by reference in its entirety into this application.
The present invention relates generally to a floating-point computation apparatus and method, and more particularly to a floating-point computation apparatus and method using Computing-in-Memory, which compute data represented in a floating-point format (hereinafter referred to as “floating-point data”) using Computing-in-Memory, thus improving the energy efficiency of a floating-point computation processor required for deep neural network training.
Since a deep neural network shows the best performance in various signal-processing fields, such as those of image classification and recognition and speech recognition, the use thereof is essentially required.
Because a process for training such a deep neural network must represent all values ranging from errors and gradients, each having a very small magnitude, to weights and neuron values, each having a relatively large magnitude, the use of floating-point computation, capable representing a wide range of values, is required.
In particular, computation in a 16-bit brain floating-point format (bfloat16) composed of one sign bit, 8 exponent bits, and 7 mantissa bits has attracted attention as an operation having high energy efficiency while maintaining the training precision of a deep neural network (Reference Document 1).
Therefore, most commercial processors (e.g., TPUv2 from Google, Armv8-A from ARM, Nervana from Intel, etc.) support deep neural network training by utilizing Brain Floating-Point (BFP) multiplication and a 32-bit FP (FP32) accumulation (Reference Document 2).
Also, for training a deep neural network, a deep neural network (DNN) accelerator must repeat processes of reading weights and neuron values stored in a memory, performing operations thereon, and then storing the results of the operation in the memory. Due thereto, a problem may arise in that the amount of power consumed by the memory is increased.
Meanwhile, recently, as a method for reducing power consumption by memory, Computing-in-Memory (CIM) has been highlighted. Computing-in-Memory (CIM) is characterized in that computation is performed in or near a memory, thus reducing the number of accesses to memory or enabling memory access with high energy efficiency.
Therefore, existing processors which utilize the characteristics of CIM can achieve the highest level of energy efficiency by reducing power consumption required for memory access (Reference Document 3 and Reference Document 4).
However, most existing CIM processors are limited in that they are specific for fixed-point computation (operations) and do not support floating-point computation.
The reason for this is that fixed-point computation uniformly represents a given range using a predefined number of bits and the fixed position of a decimal point, whereas floating-point computation includes a sign bit, exponent bits, and mantissa bits, and dynamically represents a given range depending on the exponent. That is, in the case of floating-point computation, it is very difficult to simultaneously optimize an exponent computation and a mantissa computation using a CIM processor due to the heterogeneity thereof because the exponent computation requires only simple addition or subtraction and the mantissa computation additionally requires complicated operations such as multiplication, bit shifting, or normalization.
In practice, when a floating-point multiplier-accumulator is implemented through a conventionally proposed CIM processor, a delay time ranging from several hundreds of cycles to several thousands of cycles is taken, and thus the floating-point multiplier-accumulator is not suitable for high-speed and energy efficient computations, as is required by a deep neural network (Reference Document 5).
That is, only a small number of operation logics can be integrated into a CIM processor due to the limited area thereof. Here, since existing CIM processors adopt a homogenous floating-point CIM architecture, a great speed reduction occurs during a process for dividing and performing a complicated mantissa computation using a simple CIM processor. For example, since a conventional CIM processor performs one Multiply-and-Accumulate (MAC) operation at a processing speed that is at least 5000 times slower than a floating-point system, which performs a brain floating-point multiplication and 32-bit floating-point accumulation, a problem arises in that it is impossible to utilize a conventional CIM processor in practice.
Recently, in edge devices for providing a user-custom function, the necessity for to train deep neural networks has come to the fore, whereby it is essential to extend the range of application of CIM processors to encompass floating-point computations in order to implement a deep neural network (DNN) training processor having higher energy efficiency.
Accordingly, the present invention has been made keeping in mind the above problems occurring in the prior art, and an object of the present invention is to provide a floating-point computation apparatus and method using Computing-in-Memory (CIM), which calculate data represented in a floating-point format using Computing-in-Memory so that an exponent computation and a mantissa computation are separated from each other and so that only the exponent computation is performed by a CIM processor and the mantissa computation is performed by a mantissa processing unit, thus avoiding a processing delay occurring due to the use of Computing-in-Memory for the mantissa computation.
Another object of the present invention is to provide a floating-point computation apparatus and method using computing-in-memory, which can promptly perform floating-point computation by solving a processing delay occurring due to the need to use CIM for a mantissa computation, and can improve the energy efficiency of a deep-neural network (DNN) accelerator by reducing the amount of power consumed by memory.
A further object of the present invention is to provide a floating-point computation apparatus and method using computing-in-memory, which precharge only one local bitline by supporting in-memory AND/NOR operations during the operation of a CIM processor, and which reuse a charge if the previous value of a global bitline is identical to the current value of the global bitline by adopting a hierarchical bitline structure, thus minimizing precharging, with the result that the amount of power consumed in order to access an exponent stored in memory can be reduced.
Yet another object of the present invention is to provide a floating-point computation apparatus and method using computing-in-memory, which derive sparsity patterns of input neurons that are input to a CIM processor and a mantissa processing unit and thereafter skip computation on an input neuron, the sparsity pattern of which has a value of ‘0’, thus accelerating the entire DNN computation.
Still another object of the present invention is to provide a floating-point computation apparatus and method using computing-in-memory, which skip a normalization process in an intermediate stage occurring during a mantissa computation process and perform normalization only in a final stage, thereby reducing power consumption and speed reduction attributable to communication between a CIM processor for performing an exponent computation and a mantissa processing unit for performing a mantissa computation, and which shorten the principal path of the mantissa processing unit, thereby reducing the amount of space and power consumed by the mantissa processing unit without decreasing computation precision.
In accordance with an aspect of the present invention to accomplish the above objects, there is provided a floating-point computation apparatus for performing a Multiply-and-Accumulation (MAC) operation on a plurality of pieces of input neuron data represented in a floating-point format, the floating-point computation apparatus including a data preprocessing unit configured to separate and extract an exponent and a mantissa from each of the pieces of input neuron data; an exponent processing unit configured to perform Computing-in-Memory (CIM) on input neuron exponents, which are exponents separated and extracted from the pieces of input neuron data; and a mantissa processing unit configured to perform a high-speed computation on input neuron mantissas, which are mantissas separated and extracted from the pieces of input neuron data, wherein the exponent processing unit determines a mantissa shift size for a mantissa computation and transfers the mantissa shift size to the mantissa processing unit, and wherein the mantissa processing unit normalizes a result of the mantissa computation and thereafter transfers a normalization value generated as a result of normalization to the exponent processing unit.
In accordance with another aspect of the present invention to accomplish the above objects, there is provided a floating-point computation method for performing a multiply-and-accumulation operation on a plurality of pieces of input neuron data represented in a floating-point format using a floating-point computation apparatus that includes an exponent processing unit for an exponent computation in the floating-point format and a mantissa processing unit for a mantissa computation in the floating point format, the floating-point computation method including a data preprocessing operation of separating and extracting an exponent and a mantissa from each of the pieces of input neuron data; an exponent computation operation of performing, by the exponent processing unit, computing-in-memory (CIM) on input neuron exponents, which are exponents separated and extracted in the data preprocessing operation; and a mantissa computation operation of performing, by the mantissa processing unit, a high-speed computation on input neuron mantissas, which are mantissas separated and extracted in the data preprocessing operation, wherein the exponent computation operation includes determining a mantissa shift size for the mantissa computation and transferring the mantissa shift size to the mantissa processing unit, and wherein the mantissa computation operation includes normalizing a result of the mantissa computation and thereafter transferring a normalization value generated as a result of normalization to the exponent processing unit.
The above and other objects, features and advantages of the present invention will be more clearly understood from the following detailed description taken in conjunction with the accompanying drawings, in which:
Hereinafter, embodiments of the present invention will be described in detail with reference to the attached drawings. The present invention will be described in detail such that those skilled in the art to which the present invention pertains can easily practice the present invention. The present invention may be embodied in various different forms, and is not limited to the following embodiments. Meanwhile, in the drawings, parts irrelevant to the description of the invention will be omitted so as to clearly describe the present invention. It should be noted that the same or similar reference numerals are used to designate the same or similar components throughout the drawings. Descriptions of known configurations which allow those skilled in the art to easily understand the configurations will be omitted below.
In the specification and the accompanying claims, when a certain element is referred to as “comprising” or “including” a component, it does not preclude other components, but may further include other components unless the context clearly indicates otherwise.
The data preprocessing unit 100 separates and extracts an exponent part and a mantissa part from each of the pieces of input neuron data. That is, the data preprocessing unit 100 generates at least one input neuron data pair by pairing the plurality of pieces of input neuron data that are sequentially received for a Multiply-and-Accumulate (MAC) operation depending on the sequence thereof, and separates and extracts an exponent part and a mantissa part from each of arbitrary first and second input neuron data, forming the corresponding input neuron data pair, in each preset cycle. Also, the data preprocessing unit 100 transfers the separated and extracted exponent parts (hereinafter referred to as ‘first and second input neuron exponents’) to the exponent processing unit 200, and transfers the separated and extracted mantissa parts (hereinafter referred to as ‘first and second input neuron mantissas’) to the mantissa processing unit 300.
The exponent processing unit 200 calculates the exponents separated and extracted from the pieces of input neuron data (hereinafter referred to as ‘input neuron exponents’) so that Computing-in-Memory (CIM) is performed on the first and second input neuron exponents transferred from the data preprocessing unit 100.
The mantissa processing unit 300 calculates the mantissas separated and extracted from the pieces of input neuron data (hereinafter referred to as ‘input neuron mantissas’) so that high-speed digital computation is performed on the first and second input neuron mantissas transferred from the data preprocessing unit 100.
Meanwhile, the exponent processing unit 200 determines a mantissa shift size required for the mantissa computation and transfers the mantissa shift size to the mantissa processing unit 300. The mantissa processing unit 300 must normalize the results of the mantissa computation and transmit a normalization value generated as a result of the normalization to the exponent processing unit 200.
Further, the exponent processing unit 200 determines the final exponent value based on the normalization value, receives the results of the mantissa computation from the mantissa processing unit 300, and outputs the final computation result.
As illustrated in
The input neuron exponent memory 210 stores input neuron exponents that are transferred from the data preprocessing unit 100 in each preset operation cycle. Here, the input neuron exponent memory 210 sequentially stores pairs of the first and second input neuron exponents transferred from the data preprocessing unit 100.
Each of the one or more exponent computation memories (exponent computation memory #1 220 and exponent computation memory #2 240) sequentially performs Computing-in-Memory (CIP) on the first and second input neuron exponent pairs received from the input neuron exponent memory 210, wherein computing-in-memory is performed in a bitwise manner and the results thereof are output. Although, in
Further, each of the one or more exponent computation memories (exponent computation memory #1 220 and exponent computation memory #2 240) may be any one of a weight exponent computation memory, which stores the exponent of a weight generated in a DNN training process (hereinafter referred to as a ‘weight exponent’) and performs computing-in-memory on the corresponding input neuron exponent and the weight exponent, and an output neuron exponent computation memory, which stores the exponent of output neuron data generated in the DNN training process (hereinafter referred to as an ‘output neuron exponent’) and performs computing-in-memory on the input neuron exponent and the output neuron exponent.
The exponent peripheral circuit 230 processes the results of computing-in-memory transferred from the exponent computation memories 220 and 240 and then outputs the final results. That is, the exponent peripheral circuit 230 sequentially calculates the sums of the first and second input neuron exponent pairs transferred from the exponent computation memory 220 or 240, sequentially compares the sums of the first and second input neuron exponent pairs with each other, determines the difference between the sums to be the mantissa shift size, and updates and stores a maximum exponent value.
For example, when an input neuron exponent pair (A1 and B1) is input to the exponent computation memory 220 at arbitrary time T and another input neuron exponent pair (A2, B2) is sequentially input to the exponent computation memory 220 at time (T+1), corresponding to the subsequent operation cycle, the exponent computation memory 220 sequentially calculates the sum S1 of the input neuron exponent pair (A1, B1) and the sum S2 of the input neuron exponent pair (A2, B2) and transfers the calculated sums S1 and S2 to the exponent peripheral circuit 230. The exponent peripheral circuit 230 compares the values S1 and S2 with each other, determines the difference between the values S1 and S2 to be the mantissa shift size, and updates and stores the maximum exponent value, which is a comparison value at time (T+2) corresponding to the subsequent operation cycle, with a larger one of the two values S1 and S2.
Meanwhile, the exponent peripheral circuit 230 may be shared by the one or more exponent computation memories (exponent computation memory #1 220 and exponent computation memory #2 240).
The CIM local arrays 221 each include a plurality of memory cells, and are arranged in an a×b arrangement to perform local CIM. The architecture of the CIM local arrays 221 is illustrated in
The normal input/output interface 222 provides an interface for reading/writing data from/to each of the plurality of CIM local arrays. Here, the normal input/output interface 222 provides an interface for inputting input neuron exponents to be stored in the CIM local arrays 221 for Computing-in-Memory (CIM).
The global bitlines/global bitline bars 223 form paths for moving respective results of computing-in-memory by the plurality of CIM local arrays 221 to the exponent peripheral circuit 230. In this way, in order to move the results of computing-in-memory through the global bitlines/global bitline bars 223, a large amount of energy for charging the global bitlines/global bitline bars 223 is required, but the amount of energy may be reduced by reusing global bitline charge, which will be described later.
The wordline driver 224 generates a wordline driving signal to be transferred to the CIM local arrays 221. Here, the wordline driver 224 generates the wordline driving signal with reference to an input weight index. That is, the wordline driver 224 generates the wordline driving signal for selecting an operating memory cell from among the plurality of memory cells included in each CIM local array 221 in such a way as to generate a high wordline voltage in a write mode and a low wordline voltage in a read mode.
In particular, the wordline driver 224 must output a low wordline voltage VWL as a suitably low value in order to operate the memory cells in the read mode. The reason for this is that, when the low wordline voltage VWL is excessively low, a second input neuron exponent (e.g., a weight exponent) stored in the memory cells of the CIM local arrays 221 for computing-in-memory is not reflected, and when the low wordline voltage VWL is excessively high, a first input neuron exponent, precharged in local bitlines and local bitline bars as will be described later, is not reflected. Therefore, it is preferable that the wordline driver 224 determines the low wordline voltage VWL to be within the range represented by the following Equation (1) and then output the low wordline voltage VWL. Here, the second input neuron exponent is one of the operands for computing-in-memory.
V
th
≤V
TTZ
≤V
NML
+V
th (1)
Here, VNML is a low noise margin for first and second drivers, which will be described later, and Vth is the threshold voltage of an NMOS access transistor in each memory cell.
The input neuron decoder 225 decodes the exponent value of an input neuron or an error neuron. In particular, the input neuron decoder 225 analyzes the first and second input neuron exponents, which are the targets of computing-in-memory, and performs control such that operations are performed by selecting the bitline in which the first input neuron exponent is to be charged and the memory cell in which the second input neuron exponent is to be stored.
A first input neuron exponent, which is the other one of the operands for computing-in-memory, is precharged in the local bitline/local bitline bar 11.
The VDD precharger 12 precharges the local bitline/the local bitline bar 11 based on the bit value of the first input neuron exponent. Here, the VDD precharger 12 receives the bit of the first input neuron exponent and precharges the local bitline/local bitline bar 11 in such a way that, when the corresponding bit is ‘0’, the local bitline is precharged to ‘0’ and the local bitline bar is precharged to ‘1’, and when the corresponding bit is ‘1’, the local bitline is precharged to ‘1’ and the local bitline bar is precharged to ‘0’. In a normal data read mode, both the local bitline and the local bitline bar must be precharged to ‘1’, but the present invention is advantageous in that, as described above, only one of the two bitlines needs to be precharged depending on the bit value of the input neuron exponent, thus reducing power consumption.
Each memory cell 13 stores the second input neuron exponent in a bitwise manner, performs computing-in-memory on the second input neuron exponent and the first input neuron exponent precharged in the local bitline/local bitline bar 11, and then determines the bit values of the local bitline/local bitline bar 11. For this operation, each memory cell 13 may be implemented in a 6 T SRAM bit cell structure using six transistors.
Further, each memory cell 13 may be operated in one of a read mode and a write mode in response to the wordline driving signal. For example, in the write mode, the memory cell 13 stores the second input neuron exponent, which is transferred through the input/output interface 222, in a bitwise manner, and in the read mode, the memory cell performs computing-in-memory on the first input neuron exponent, which is precharged in the local bitline/the local bitline bar 11, and the second input neuron exponent, which is stored in a bitwise manner, and then determines the bit values of the local bitline/the local bitline bar 11. In this case, the value of the local bitline is determined by performing an AND operation on the first input neuron exponent and the second input neuron exponent in a bitwise manner, and the value of the local bitline bar is determined by performing a NOR operation on the first input neuron exponent and the second input neuron exponent in a bitwise manner. A truth table indicating the results of such computing-in-memory is illustrated in
Each of the first driver 14 and the second driver 15 drives the bit values of the local bitline/local bitline bar 11 to a global bitline/global bitline bar 223 in response to a global bitline enable signal received from outside.
In this way, the exponent computation memory 220 according to the present invention adopts a hierarchical bitline structure, and thus the first and second drivers 14 and 15 charge and discharge the global bitline/global bitline bar 223 based on the values of the local bitline/local bitline bar 11. That is, in the exponent computation memory 220 according to the present invention, computation between the plurality of CIM local arrays 221 is determined depending on a previous global bitline value and a current global bitline value, and a truth table for such global bitline computation is exemplified in
Meanwhile, each of the CIM local arrays 221 sequentially performs a precharge process using the VDD precharger 12, a computing-in-memory process on each of the plurality of memory cells 13, and a driving process using each of the first and second drivers 14 and 15, and adopts a computation pipelining structure for operating different CIM local arrays in each cycle in order to prevent an operating speed from decreasing due to the sequential performance of the processes. That is, each CIM local array 221 adopts the pipelining structure in which the precharge process, the Computing-in-Memory (CIM) process, and the driving process can be pipelined to overlap each other between adjacent CIM local arrays. Therefore, a precharge process for an arbitrary n-th CIM local array, a CIM process for an (n+1)-th CIM local array, and a driving process for an (n+2)-th CIM local array may be pipelined to overlap each other. The pipelined operation between the CIM local arrays is illustrated in
When both Computing-in-Memory (CIM) and the exponent adder 20 are activated, the exponent adder 20 performs addition on exponents, receives, as inputs, the results of CIM transferred through the global bitline/global bitline bar 223, and calculates the sums of first and second input neuron exponent pairs.
The exponent comparator 30 sequentially compares the sums of the first and second input neuron exponent pairs received from the exponent adder 20 with each other, determines the difference between the sums to be a mantissa shift size, and updates and stores a maximum exponent value based on the comparison results. Here, the process for determining the mantissa shift size and updating and storing the maximum exponent value has been described above with reference to
For this operation, the exponent comparator 30 may include a floating-point exception handler 31 for receiving the sums of first and second input neuron exponent pairs from the exponent adder 20 and performing exception handling in floating-point multiplication; a register 32 for storing the maximum exponent value, which is the maximum value, among the sums of first and second input neuron exponent pairs calculated during a period ranging to a previous operation cycle; a subtractor 33 for obtaining the difference between the sum of the first and second input neuron exponent pairs, output from the floating-point exception handler 31, and the maximum value stored in the register 32; and a comparator 34 for updating the maximum exponent value stored in the register 32 based on the results of subtraction by the subtractor 33, determining the mantissa shift size, and transferring the maximum exponent value and the mantissa shift size to the mantissa processing unit 300 illustrated in
Here, the register 32 may determine a final exponent value by updating the maximum exponent value based on a normalization value received from the mantissa processing unit 300. That is, the register 32 may update the maximum exponent value based on the normalization value received as a result of the normalization performed only once in the last mantissa computation because preliminary normalization processing, which will be described later, is applied to the intermediate stage of the mantissa processing unit 300.
The input neuron mantissa memory 310 stores input neuron mantissas that are transferred from the data preprocessing unit 100 in each preset operation cycle. Here, the input neuron mantissa memory 310 sequentially stores pairs of first and second input neuron mantissas transferred from the data preprocessing unit 100.
The weight mantissa memory 320 separates only a mantissa part of a weight generated in a process of training the deep neural network (hereinafter referred to as a ‘weight mantissa part’), and separately stores the weight mantissa part.
The output neuron mantissa memory 330 separates only a mantissa part of output neuron data (hereinafter referred to as an ‘output neuron mantissa part’) generated in the process of training the deep neural network, and separately stores the output neuron mantissa part.
The plurality of mantissa computation units 340 are connected in parallel to each other, and are configured to sequentially calculate the first and second input neuron mantissa pairs and to normalize final calculation results, wherein each of the mantissa computation units 340 may calculate mantissas, received from at least one of the input neuron mantissa memory 310, the weight mantissa memory 320, and the output neuron mantissa memory 330, at high speed.
Also, each of the mantissa computation units 340 transfers the normalization value, generated as a result of normalization, to the exponent processing unit 200, thus allowing the exponent processing unit 200 to determine a final exponent value.
Here, each of the mantissa computation units 340 performs a normalization process once only after the final mantissa computation has been performed, rather than performing normalization every time addition is performed. For this, in a mantissa computation in an intermediate stage, the mantissa computation unit 340 replaces the normalization process with preliminary normalization which stores only a mantissa overflow and an accumulated value of addition results. The reason for this is to improve processing speed and reduce power consumption by reducing traffic between the exponent processing unit 200 and the mantissa processing unit 300 and simplifying a complicated normalization process.
For example, in the case of a multiply-and-accumulate operation in which 20 operands are calculated in such a way that two respective operands are paired and multiplied and the results of multiplication are accumulated, a normalization process for nine pairs, among 10 pairs, is replaced with a preliminary normalization scheme in which, after mantissa addition, accumulated mantissas are represented by a mantissa overflow counter and a mantissa accumulation value, and exponents are represented by continuously storing the maximum values obtained as results of comparisons, and the last pair, that is, the tenth pair, is calculated such that, only when the last pair is added to an existing accumulated value, a mantissa normalization and rounding-off operation and an exponent update operation depending thereon are performed once.
In this case, the precision of the preliminary normalization scheme is determined depending on the limited range of representation of accumulated mantissa values and mantissa overflow count values, and the limited range of representation thereof is determined depending on the assigned bit width. Unless the bit width of the overflow counter is sufficient, if the result of addition of mantissas falls out of the range of the maximum values of the current exponent (overflow), it is impossible to represent the addition result, thus greatly deteriorating precision. Meanwhile, unless the bit width assigned to the accumulated mantissa value is sufficient, if the result of addition of the mantissas becomes much less than values falling within the range of maximum values of the current exponent (underflow), a small portion of the result is continuously discarded, thus deteriorating the precision of computation. In the case of a 32-bit floating-point accumulation operation, when an accumulated mantissa value of 21 or more bits and an overflow counter value of three or more bits are used, preliminary normalization may be performed without causing a computation error.
The multiplier 341 performs multiplication on pairs of first and second input neuron mantissas, and stores the results of the multiplication.
The shifter 342 performs shifting on the multiplication results based on the mantissa shift size transferred from the exponent processing unit 200.
The mantissa adder 343 performs addition on the one or more shifted multiplication results.
The overflow counter 344 counts a mantissa overflow occurring as a result of the addition.
The register 345 accumulates and stores the results of the addition.
The normalization processor 346 normalizes the results of the mantissa computation. Here, the normalization processor 346 is operated once only after a final mantissa computation is performed, thus performing normalization.
That is, the mantissa computation unit 340 sequentially performs the mantissa computation on all of the first and second input neuron mantissa pairs stored in the input neuron mantissa memory 310, and performs a normalization process once only for the final mantissa computation result.
For this operation, the overflow counter 344 and the register 345 transfer the mantissa overflow value and the accumulated stored value of the addition results, which are generated in the intermediate operation stage of the mantissa computation, to the shifter 342 so as to perform an operation in the subsequent stage, and transfer a mantissa overflow value and the addition result, which are generated during a mantissa computation in the final stage, to the normalization processor 346.
Due thereto, the normalization processor 346 performs normalization only on the mantissa overflow value and the addition result, which are generated during the mantissa computation in the final stage.
First, at step S100, the data preprocessing unit 100 performs preprocessing for a multiply-and-accumulation (MAC) operation on a plurality of pieces of input neuron data represented in a floating-point format. That is, at step S100, the data preprocessing unit 100 separates and extracts exponents and mantissas from the pieces of input neuron data.
For this, at steps S110 and S120, the data preprocessing unit 100 generates at least one input neuron data pair by pairing two or more pieces of input neuron data that are sequentially input for the MAC operation depending on the sequence thereof.
At step S130, the data preprocessing unit 100 separates and extracts an exponent and a mantissa from each of arbitrary first and second input neuron data, forming the corresponding input neuron data pair, in each preset operation cycle.
At step S140, the data preprocessing unit 100 transfers the separated and extracted exponents (hereinafter referred to as ‘first and second input neuron exponents’) to the exponent processing unit 200, and transfers the separated and extracted mantissas (hereinafter referred to as ‘first and second input neuron mantissas’) to the mantissa processing unit 300.
Here, since the detailed operation of the data preprocessing unit 100 for performing step S100 (data preprocessing) is identical to that described above with reference to
At step S200, exponents and mantissas preprocessed at step S100 are separated and stored. That is, at step S200, the exponent processing unit 200 and the mantissa processing unit 300 store the exponents and the mantissas, respectively, received from the data preprocessing unit 100.
At step S300, the floating-point computation apparatus checks the type of computation, and proceeds to step S400 when the type of computation is an exponent computation, otherwise proceeds to step S500.
At step S400, the exponent processing unit 200 performs Computing-in Memory (CIM) on the exponents separated and extracted at step S100 (hereinafter referred to as ‘input neuron exponents’).
For this operation, at step S410, the exponent processing unit 200 sequentially calculates the first and second input neuron exponent pairs that are transferred in each operation cycle so that computing-in-memory is performed in a bitwise manner and the sums of the first and second input neuron exponent pairs are calculated.
At step S420, the exponent processing unit 200 sequentially compares the sums of the first and second input neuron exponent pairs, which are sequentially calculated at step S410, with each other, and determines the difference between the sums to be a mantissa shift size. At step S430, the exponent processing unit 200 transfers the determined mantissa shift size to the mantissa processing unit 300.
At step S440, the exponent processing unit 200 determines a larger one of the sums of the first and second input neuron exponent pairs as a result of the sequential comparison to be the maximum exponent value.
At step S450, the exponent processing unit 200 determines whether a normalization value is received from the mantissa processing unit 300, and repeats steps S410 to S440 until the normalization value is received.
Meanwhile, at step S460, when a normalization value is received from the mantissa processing unit 300, the exponent processing unit 200 determines a final exponent based on the received normalization value at step S470.
Here, since the detailed operation of the exponent processing unit 200 for performing step S400 (exponent computation) is identical to that described above with reference to
At step S500, the mantissa processing unit 300 performs high-speed computation on the mantissas separated and extracted at step S100 (hereinafter referred to as ‘input neuron mantissas’).
For this operation, at step S510, the mantissa processing unit 300 sequentially calculates the first and second input neuron mantissa pairs that are generated in each operation cycle so that multiplication on the first and second input neuron mantissa pairs is calculated.
At step S520, when a mantissa shift size is received from the exponent processing unit 200, the mantissa processing unit 300 performs shifting on the results of the multiplication, calculated at step S510, at step S530. That is, the decimal point of the mantissa is shifted by the mantissa shift size.
At step S540, the mantissa processing unit 300 performs addition on the one or more shifted multiplication results.
At step S550, the, mantissa processing unit 300 counts a mantissa overflow value generated as a result of the addition at step S540, and at step S560, the mantissa processing unit 300 accumulates and stores the results of the addition at step S540.
At step S570, the mantissa processing unit 300 performs a final operation determination step of determining whether a mantissa computation has been finally performed on all of first and second input neuron mantissa pairs received at step S140, and repeats steps S510 to S560 until the final operation step is completed, thus sequentially performing the mantissa computation on all of the first and second input neuron mantissa pairs received at step S140.
Here, at step S530, preliminary normalization may be performed on the multiplication results at step S510 based on the mantissa overflow value, which is generated in the intermediate operation stage that is an operation stage before the final operation stage is performed, that is, the mantissa overflow value counted at step S550, and the accumulated and stored value of the addition results at step S560.
Meanwhile, when the final operation is determined at step S570, the mantissa processing unit 300 normalizes mantissa computation results in the final computation result at step S580, and the mantissa processing unit 300 outputs a normalization value, generated as a result of the normalization, to the exponent processing unit 200 at step S590.
Here, since the detailed operation of the mantissa processing unit 300 for performing step S500 (mantissa computation) is identical to that described above with reference to
As described above, the present invention is advantageous in that a conventional inefficient repetitive computation process attributable to Computing-in-Memory (CIM) for a mantissa part may be removed by processing computation by separating an exponent part and the mantissa part from each other, thus achieving a delay time of less than 2 cycles, with the result that processing speed may be remarkably improved compared to a conventional computation architecture having a delay time of 5000 or more cycles.
Further, the present invention is characterized in that it can not only reduce power consumption and speed reduction attributable to communication between an exponent processing unit and a mantissa processing unit by replacing a normalization process with preliminary normalization, but can also decrease consumption of space and power by mantissa computation units without decreasing computation precision by shortening the principal path of the mantissa computation units. For example, in the computation apparatus for DNN training, space and power consumption may be reduced on average by 10 to 20%, and in particular, in a system supporting brain floating-point multiplication and 32-bit floating-point accumulation for a ResNet-18 neural network, total power consumption by the computation units may be reduced by 14.4%, and total space consumption by the computation units may be reduced by 11.7%.
Further, the floating-point computation apparatus according to the present invention is characterized in that the amount of power required in order to access the exponent stored in memory may be greatly reduced owing to the reuse of charge by minimizing precharging of local bitlines and global bitlines. For example, in the computation apparatus for DNN training, total power consumption by memory may be reduced on average by 40 to 50%, and in particular, in a system supporting brain floating-point multiplication and 32-bit floating-point accumulation for the ResNet-18 neural network, total power consumption by memory may be reduced by 46.4%.
As described above, a floating-point computation apparatus and method using Computing-in-Memory (CIM) according to the present invention are advantageous in that numbers represented in a floating-point format may be calculated using CIM such that an exponent computation and a mantissa computation are separated from each other and such that only the exponent computation is performed by a CIM processor and the mantissa computation is performed by a mantissa processing unit, thus avoiding a processing delay occurring due to the use of Computing-in-Memory for the mantissa computation.
Further, the present invention is advantageous in that floating-point computation may be promptly performed by solving a processing delay occurring due to the need to use CIM for a mantissa computation, and the energy efficiency of a deep-neural network (DNN) accelerator may be improved by reducing the amount of power consumed by memory.
Furthermore, the present invention is advantageous in that only one local bitline may be precharged by supporting in-memory AND/NOR operations during the operation of a CIM processor, and charge may be reused if the previous value of a global bitline is identical to the current value of the global bitline by adopting a hierarchical bitline structure, thus minimizing precharging, with the result that the amount of power consumed in order to access an exponent stored in memory may be greatly reduced.
Furthermore, the present invention is advantageous in that sparsity patterns of input neurons that are input to a CIM processor and a mantissa processing unit may be derived, and thereafter computation on an input neuron, the sparsity pattern of which has a value of ‘0’, may be skipped, thus accelerating the entire DNN computation.
Furthermore, the present invention is advantageous in that a normalization process in an intermediate stage occurring during a mantissa computation process is skipped and normalization is performed only in a final stage, thereby reducing power consumption and speed reduction attributable to communication between a CIM processor for performing an exponent computation and a mantissa processing unit for performing a mantissa computation, and in that the principal path of the mantissa processing unit is shortened, thereby reducing consumption of space and power by the mantissa processing unit without decreasing computation precision.
Although the preferred embodiments of the present invention have been disclosed in the foregoing descriptions, those skilled in the art will appreciate that the present invention is not limited to the embodiments, and that various modifications, additions and substitutions are possible, without departing from the scope and spirit of the invention as disclosed in the accompanying claims.
Number | Date | Country | Kind |
---|---|---|---|
10-2021-0183937 | Dec 2021 | KR | national |