The invention relates to applications of floating-point operations, and more particularly, to floating-point calculation method and an associated arithmetic unit.
With the increasing development of the machine learning techniques, the loading of floating-point operations has become greater and greater. Hence, how to compress the large amount of floating-point data to increase operation speed and reduce power consumption has become a hot issue for those in the field to study. Existing floating-point techniques mostly adopt uniform coding and operations, which leads to an overdesign that wastes storage space due to storing unnecessary data, thereby increasing transmission time and power consumption.
In view of the above, there is a need for a novel floating-point calculation method and an associated hardware architecture to solve the above-mentioned problem encountered in related art techniques.
According to the above requirements, one of the purposes of the present invention is to provide an efficient floating-point coding and calculation method to solve the problems encountered in conventional floating-point operations, without greatly increasing the cost, and thereby improve the operation speed and reduce the power consumption.
An embodiment of the present invention provides a floating-point calculation method applicable to multiplication between a first register and a second register. The first register stores a first floating point number, and the second register stores a second floating point number. The first register comprises a first exponent bit(s) storing a first exponent, and a first mantissa bit(s) storing a first mantissa. The second register comprises a second exponent bit(s) storing a second exponent, and a second mantissa bit(s) storing a second mantissa. The method comprises using an arithmetic unit to perform following steps: comparing the first exponent with an exponent threshold, wherein when the first exponent is not smaller than the exponent threshold, the first mantissa is multiplied by the second mantissa to generate a mantissa operation result; and when the first exponent is smaller than the exponent threshold, the first mantissa is multiplied by the second mantissa after at least one bit of the first mantissa is discarded, to generate the mantissa operation result; adding the first exponent to the second exponent to generate an exponent operation result; and generating a calculated floating point number according to the mantissa operation result and the exponent operation result.
In addition to the above method, another embodiment of the present invention provides an arithmetic unit coupled to a first register and a second register. The first register stores a first floating point number, and the second register stores a second floating point number. The first register comprises a first exponent bit(s) storing a first exponent, and a first mantissa bit(s) storing a first mantissa. The second register comprises a second exponent bit(s) storing a second exponent, and a second mantissa bit(s) storing a second mantissa. When performing multiplication between the first register and the second register, the arithmetic unit performs following steps: comparing the first exponent with an exponent threshold, wherein when the first exponent is not smaller than the exponent threshold, the first mantissa is multiplied by the second mantissa to generate a mantissa operation result; and when the first exponent is smaller than the exponent threshold, the first mantissa is multiplied by the second mantissa after at least one bit of the first mantissa is discarded to generate the mantissa operation result; adding the first exponent to the second exponent to generate an exponent operation result; and generating a calculated floating point number according to that mantissa operation result and the exponent operation result.
Another embodiment of the present invention provides an arithmetic device comprising a first register, a second register and an arithmetic unit. The arithmetic unit is coupled to the first register and the second register, and the first register stores a first floating point number and the second register stores a second floating point number. The first register comprises a first exponent bit(s) storing a first exponent, and a first mantissa bit(s) storing a first mantissa. The second register comprises a second exponent bit(s) storing a second exponent, and a second mantissa bit(s) storing a second mantissa; wherein when performing multiplication between the first register and the second register, the arithmetic unit performs following steps: comparing the first exponent with an exponent threshold, wherein when the first exponent is not smaller than the exponent threshold, the first mantissa is multiplied by the second mantissa to generate a mantissa operation result; when the first exponent is smaller than the exponent threshold, the first mantissa is multiplied by the second mantissa after at least one bit of the first mantissa is discarded, to generate the mantissa operation result; adding the first exponent to the second exponent to generate an exponent operation result; and generating a calculated floating point number according to the mantissa operation result and the exponent operation result.
Selectively, according to an embodiment of the present invention, the exponent threshold is stored in a third register, and the arithmetic unit accesses the third register when performing multiplication between the first register and the second register.
Selectively, according to an embodiment of the present invention, the first register further comprises a first sign bit storing a first sign, the second register further comprises a second sign bit storing a second sign. The floating-point calculation method further comprises: performing an XOR operation upon the first sign and the second sign to generate a sign operation result; and generating the calculated floating point number according to the mantissa operation result, the sign operation result and the exponent operation result.
Selectively, according to an embodiment of the present invention, when the first exponent is smaller than the exponent threshold, at least one bit of the first mantissa is temporarily stored but not used in arithmetic operations.
Selectively, according to an embodiment of the present invention, the exponent threshold is dynamically adjustable.
Selectively, according to an embodiment of the present invention, the exponent threshold is dynamically adjusted according to temperature of the arithmetic unit and/or types of tasks to be processed by the arithmetic unit.
Selectively, according to an embodiment of the present invention, the exponent threshold is within a dynamically adjustable range, and the arithmetic unit starts training with an exponent threshold with a value of 1. The arithmetic unit determines a criteria whether an operation precision is higher than a precision threshold. If the criteria is met, the value of the exponent threshold is increased until the operation precision is not higher than a precision threshold, and the dynamically adjustable range is presented by the exponent threshold that meets the criteria.
Selectively, according to an embodiment of the present invention, the first register is coupled to a memory arranged to store a first exponent. When the first exponent is smaller than the exponent threshold, at least one bit of the first mantissa is discarded without being stored in the memory.
Selectively, according to an embodiment of the present invention, when the first exponent is smaller than the exponent threshold, at least one bit of the first mantissa is in a Don’t Care state.
Selectively, according to an embodiment of the present invention, when the first exponent is smaller than the exponent threshold, the first floating point number is decoded into (-1)Signl × 2Exponent1, where Sign1 denotes the first sign, and Exponent1 denotes the first exponent.
Selectively, according to an embodiment of the present invention, when the second exponent is smaller than the exponent threshold, the second floating point number is decoded into (-1)Sign2 × 2Exponent2, where Sign2 denotes the second sign, and Exponent2 denotes the second exponent.
Selectively, according to an embodiment of the present invention, the arithmetic unit is further used to access a memory that is arranged to store a plurality of groups of batch normalization coefficients corresponding to a plurality of candidate thresholds respectively, and the exponent threshold is selected from one of the candidate thresholds. Batch normalization coefficient is a kind of coefficient for adjusting the average and standard deviation of numerical values in artificial intelligence (AI) operations. Generally, a piece of numerical data of a feature map corresponds to a set of specific batch normalization coefficients. According to this embodiment, the operation process of a piece of numerical data of a feature map may vary due to different the exponent threshold, and the way the mantissa is discarded may also be different. Taking the above factors into account, the present invention correspondingly provides a plurality of groups of batch normalization coefficients.
In view of the above, once the exponent value of a floating point number is smaller than the threshold value, the present invention may discard the mantissa to further save storage space. Further, the present invention may just store the mantissa, without the mantissa involving in decoding or operations, to further save the time and effort for data transmission and reduce the power consumption of the operations. In addition, through the adjustability of the threshold, the corresponding electronic product can flexibly make a tradeoff between the high-performance mode and the low-power mode, so that the present invention can save power consumption and increase the processing speed under while keeping the required operation precision.
The present disclosure is particularly described by following examples that are mainly for illustrative purposes. For those who are familiar with the technologies, various modifications and embellishments can be made without departing from the spirit and scope of the present disclosure, and thus the scope of the present disclosure shall be subject to the content of the attached claims. In the entire specification and claims, unless clearly specified, terms such as “a/an” and “the” can be used to describe “one or at least one” assembly or component. In addition, unless the plural use is obviously excluded in the context, singular terms may also be used to present plural assemblies or components. Unless otherwise specified, the terms used in the entire specification and claims generally have the common meaning as those used in this field. Certain terms used to describe the disclosure will be discussed below or elsewhere in this specification, so as to provide additional guidance for practitioners. The examples throughout the entire specification as well as the terms discussed herein are only for illustrative purposes, and are not meant to limit the scope and meanings of the disclosure or any illustrative term. Similarly, the present disclosure is not limited to the embodiments provided in this specification.
The terms “substantially”, “around”, “about” or “approximately” used herein may generally mean that the error of a given value or range is within 20%, preferably within 10%. In addition, the quantity provided herein can be approximate, which means that unless otherwise stated, it can be expressed by the terms “about”, “nearly”, etc. When the quantity, concentration, or other values or parameters have a specified range, a preferred range, or upper and lower boundaries listed in the table, they shall be regarded as a particular disclosure of all possible combinations of ranges constructed by those upper and lower limits or ideal values, no matter such kind of ranges have been disclosed or not. For example, if the length of a disclosed range is X cm to Y cm, it should be regarded as that the length is H cm, and H can be any real number between x and y.
In addition, the term “electrical coupling” or “electrical connection” may include direct and indirect means of electrical connection. For example, if the first device is described as electrically coupled to the second device, it means that the first device can be directly connected to the second device, or indirectly connected to the second device through other devices or means of connection. In addition, if the transmission and provision of electric signals are described, those who are familiar with the art should understand that the transmission of electric signals may be accompanied by attenuation or other non-ideal changes. However, unless the source and receiver of the transmission of electric signals are specifically stated, they should be regarded as the same signal in essence. For example, if the electrical signal S is transmitted from the terminal A of the electronic circuit to the terminal B of the electronic circuit, which may cause voltage drop across the source and drain terminals of the transistor switch and/or possible stray capacitance, but the purpose of this design is to achieve some specific technical effects without deliberately using attenuation or other non-ideal changes during transmission, the electrical signals S at the terminal A and the terminal B of the electronic circuit should be substantially regarded as the same signal.
The terms “comprising”, “having” and “involving” used herein are open-ended terms, which can mean “comprising but not limited to”. In addition, the scope of any embodiment or claim of the present invention does not necessarily achieve all the purposes, advantages or features disclosed in the present invention. In addition, the abstract and title are only used to assist the search of patent documents, and are not used to limit the scope of claims of the present invention.
Please refer to
where Sign denotes the sign of the floating point number and Exponent denotes the exponent of the floating point number. In general, the leftmost bit of the register is allocated as a sign bit to store the sign, while the remaining bits (e.g., the remaining 7 – 63 bits) are allocated as exponent bits and mantissa bits to store the exponent and mantissa respectively. In the example of
Next, please refer to
In another example, the decimal value “-0.002” is converted into a binary floating point number “10111011000000110001001001101111”, in which the first bit from the most significant bit stores “1” to indicate the sign, while the second to ninth bits store the exponent and the remaining bits store the mantissa. When the second to ninth bits “01110110″is smaller than the exponent threshold, the mantissa “00000110001001001101111” is regarded as insignificant and thus will not be stored, so that the 10th to 32nd bits will be empty in this situation. As a result, when this floating point number is operated with other floating point numbers in a follow-up process, the mantissa will not participate in operations. In other words, when the exponent is smaller than the exponent threshold in a floating point number, it can be determined that value of the floating point number is small enough. In this way, under the situation where the mantissa of the floating point number is ignored, the floating point number can be decoded as the follows:
where not all bits of mantissa have to involve in calculations or be sent to a register, thus saving the power consumption and time for transmission. In some cases, even the mantissa may also not be stored in the memory, which can further save more storage space. In another embodiment, at least one bit of the mantissa does not participate in calculation and is not transmitted into a register and/or a memory, so as to further save storage space.
In yet another example, the decimal value “0.003” is converted into a binary floating point number, i.e., “00111011010001001001101110100110”, in which the first bit from the most significant bit stores “0” to indicate the sign, the second to ninth bits store the exponent, and the remaining bits store the mantissa. When the second to ninth bits “01110110” is smaller than the exponent threshold, the mantissa “10001001001101110100110” can be negligible, but is still stored in the 10th to 32nd bits and marked as “Don’t care”. In this way, when this floating point number is operated with other floating point numbers, the mantissa will not participate in the calculation. The difference between this example and the previous example is that the mantissa in this example can exist without being decoded nor operated, so as to further save the operational power consumption and save time for data transmission. Similarly, in the example of
Please refer to
When processing the multiplication operation between the first register 111 and the second register 112, the arithmetic unit 110 compares the first exponent with the exponent threshold through the comparison logic 144, wherein when the first exponent is not smaller than the exponent threshold, which indicates that the number representing the first floating point number is relatively large and thus the effective digit of the mantissa cannot be ignored, the multiplication logic 143 will multiply the first mantissa by the second mantissa to generate the mantissa operation result (i.e., the output of the comparison logic 144). If the first exponent is smaller than the exponent threshold, which indicates that the number of the first floating point number is relatively small, the mantissa significant digits can be ignored. Then, after discarding at least one bit (such as one or more bits), the first mantissa is multiplied by the second mantissa to obtain the mantissa operation result. This step may comprise discarding just one bit or several bits, or even all bits (i.e., ignoring the whole first mantissa, which is equivalent to directly generating the mantissa operation result according to the second mantissa). Preferably, discarding the whole first mantissa can reduce more power consumption. However, if there is a demand for higher precision, the goal of reducing the power consumption can be achieved even by discarding only one bit. In addition, the XOR operation between the first sign and the second sign can be performed by the XOR logic 141 to generate a sign operation result (i.e., the output of the XOR logic 141), and the first exponent can be added to the second exponent by the addition logic 142 to generate an exponent operation result (i.e., the output of the addition logic 142). Finally, a calculated floating point number is generated according to the mantissa operation result, the sign operation result and the exponent operation result, and serves as the final operation result. When the first exponent is smaller than the exponent threshold, the first floating point number is decoded into “(-1)Sign1 × 2Exponent1”, where Sign1 denotes the first sign, and Exponent1 denotes the first exponent. Similarly, in addition to comparing the first exponent with the exponent threshold, this embodiment can further compare the second exponent with the exponent threshold. When the second exponent is smaller than the exponent threshold the second floating point number is decoded into “(-1)Sign2 × 2Exponent2”, where Sign2 denotes the second sign and Exponent2 denotes the second exponent. In this embodiment, the illustrated XOR logic 141, addition logic 142, multiplication logic 143 and comparison logic 144 are merely for illustrative purposes. The exact ways of implementation may base on the actual needs, and can be different from what shown in this embodiment. However, the present invention comprises all possible details adjustment without additional restrictions. In an example of the present invention, the multiplication logic 143 of a single-precision floating-point arithmetic unit may interpret the mantissa in the form of “l.Mantissa”, where “l” on the left of the decimal point is an integer, and “Mantissa” on the right of the decimal point denotes the mantissa. In addition, the addition logic 142 of the single-precision floating-point arithmetic unit interprets Exponent as “Exponent - 127” (referred to as “Exponent minus 27”), and then perform the addition operations, but the present invention is not limited thereto. Although the above mostly relates to simplification of the storing and transmission of the first mantissa, the same concept can also be applied to the second mantissa. For example, the roles of the above-illustrated first and second mantissas can be interchanged, or the simplification of the storing and transmission can be performed on both of the first and second mantissas.
According to different embodiments of the present invention, the exponent threshold may be a fixed value, or dynamically adjustable. With the design of an adjustable threshold, the desired precision of floating-point operations can be selected. For example, if the threshold is large, there will be more mantissas that are not decoded, and thus the power consumption of data transmission and operation can be greatly reduced. The exponent threshold can be dynamically adjusted according to the temperature of the arithmetic unit 110 and/or the type of the tasks to be processed by the arithmetic unit 110. For example, when the current temperature of the arithmetic device 100 is too high and needs to be cooled down, the exponent threshold can be tuned up so that the arithmetic unit 110 can operate in a low power consumption and low temperature mode. In addition, when the arithmetic device 100 is a mobile device and does not have much power left, the exponent threshold can also be tuned up to extend the standby time of the mobile device. In addition, if the arithmetic unit 110 performs operations that require good precision, the exponent threshold can be tuned down so that more mantissas can be decoded, thereby improving the precision.
Selectively, according to the embodiment of the present invention, the exponent threshold is in a dynamically adjustable range. The arithmetic unit 110 starts training with an exponent threshold with a value of 1, and the arithmetic unit 110 determines a criteria whether the operation precision is higher than the precision threshold, and if the criteria is met, it tunes up the value of the exponent threshold until the operation precision is not higher than the precision threshold, and the dynamically adjustable range may be the exponent threshold that meets the conditions. The invention ignores the mantissas of the floating point numbers with small values, and decodes the mantissa of those with large values. Compared with the conventional techniques, the present invention can avoid over-designing the hardware architecture such that the hardware architecture design can be simplified, thus saving the power consumption and time of data storage and data transmission.
As can be seen from the above embodiments, since the arithmetic device 100 can be applied in different scenarios, how to properly select the exponent threshold is very important, which yields the optimal tradeoff between precision, power consumption and processing speed. If the present invention is applied to the artificial intelligence (AI) model, an appropriate exponent threshold can be calculated according to the current requirements of the arithmetic device 100. Please refer to
Step S502: Set an initial value of the exponent threshold to 1.
Step S504: Apply an exponent threshold to the AI model.
Step S506: Retrain the AI model according to the exponent threshold.
Step S508: Determine whether the decline of the precision of the floating-point operation has reached the maximum acceptable degree of the AI model; if yes, the flow enters Step S510; if not, the flow enters Step S512.
Step S510: Tune up the exponent threshold.
Step S512: The training is completed.
To sum up,
Please refer to
Step S602: Determine whether the processing chip needs to reduce the power consumption; if yes, the flow enters Step S604; if not, the flow enters Step S608.
Step S604: Determine whether the decline of the precision of the floating-point operation has reached the maximum acceptable degree of the AI model; if not, the flow enters Step S606; if yes, the flow enters Step S608.
Step S606: Tune up the exponent threshold.
Step S608: The process ends.
To sum up,
Please refer to
Step S702: Determine whether the calculation precision of the processing chip needs to be improved; if yes, the flow enters Step S704; if not, the flow enters Step S708.
Step S704: Determine whether the exponent threshold is 1 (i.e., the minimum value of the exponent threshold); if not, the flow enters Step S706; if yes, the flow enters Step S708.
Step S706: Tune down the exponent threshold.
Step S708: The process ends.
To sum up,
Please refer to
Step S802: Compare a first exponent with an exponent threshold, wherein when the first exponent is not smaller than the exponent threshold, the first mantissa is multiplied by the second mantissa to generate a mantissa operation result; and when the first exponent is smaller than the exponent threshold, the first mantissa is multiplied by the second mantissa after discarding at least one bit to generate a mantissa operation result.
Step S804: Perform an XOR operation upon a first sign and a second sign to generate a sign operation result.
Step S806: Add the first exponent to a second exponent to generate an exponent operation result; and
Step S808: Generate a calculated floating point number according to the mantissa operation result, the sign operation result and the exponent operation result.
Since those skilled in the art can easily understand the details of each step in
In view of the above, once the exponent value of a floating point number is smaller than the threshold value, the present invention may discard the mantissa (i.e., the mantissa will not be stored in a memory) to further save storage space, or only store the mantissa without involving in decoding and operation, so as to save the time and effort for data transmission and the operational power consumption. In addition, through the adjustability of the threshold(see the optimization flow in
Number | Date | Country | Kind |
---|---|---|---|
112103523 | Feb 2023 | TW | national |
Number | Date | Country | |
---|---|---|---|
63305711 | Feb 2022 | US |