The following is related generally to the field of microprocessors and, more specifically, to microprocessor-based devices for performing floating-point arithmetic.
Computer systems frequently include a floating-point unit, or FPU, often referred to as a math coprocessor. In general-purpose computer architectures, one or more FPUs may be integrated as execution units within the central processing unit. A important category of floating point calculations is for the calculation of dot-products (or inner-products) of vectors, in which a pair of vectors are multiplied component by component and the results then added up to provide a scalar output result. An important application of dot-products is in artificial neural networks. Artificial neural networks are finding increasing usage in artificial intelligence applications and fields such as image and video recognition, recommender systems, image classification, medical image analysis, natural language processing, expert systems, autonomous (self-driving) vehicles, data mining, and many other applications. An artificial neural network is formed of a large number of layers through which an initial input is propagated. At each layer, the input will be a vector of values that is multiplied with a vector of weights as a dot-product to provide an output for the layer. Such artificial neural networks can have very large numbers of layers (network depth) and involve large numbers of dot-products within each of layer (network width), so that propagating an initial input through a network is extremely computationally intensive. When training an artificial neural network (i.e., the process of determining a network's weight values), a number of iterations are typically required to be repeatedly run through the network to determine accurate weight values. Given the increasing importance of artificial networks, the ability to efficiently compute large numbers of dot-products is of great importance.
When computing a dot-product of floating-point vectors, the components of the vectors are individually multiplied and summed. To properly align the accumulated sum of the dot product, the maximum exponent of the individual products needs to be determined, as each mantissa dot-product must be right-shifted by the difference between the maximum exponent and each dot-product's exponent. This process can quite time consuming, requiring several processing cycles and slowing down the dot-product computation. Given the extremely large numbers of dot-product computations involved in both the training and inferencing phases for artificial neural networks, the ability to more rapidly compute dot-products is of increasing importance.
According to one aspect of the present disclosure, there is a method of calculating a floating-point dot-product performed by a processor, comprising receiving a sequence of first floating-point numbers of a first operand at a floating-point unit (FPU) processor, the sequence of the first floating-point numbers having a sign, a mantissa value and an exponent value; receiving a sequence of second floating-point numbers of a first operand at a floating-point unit (FPU) processor, the sequence of the second floating-point numbers having a sign, a mantissa value and an exponent value; storing the of the sequence of the first and the second floating-point numbers in one of a memory or a register; determining, by the FPU, the floating-point dot-product of the sequence of the first floating-point numbers and the sequence of the second floating-point numbers, by: adding the exponent values of the sequence of the first and the second floating-point numbers to determine an exponent product, the exponent product having a high exponent and a low exponent; calculate a shift amount as a one's complement of the low exponent; multiplying the mantissas of the sequence of the first and second floating-point numbers to determine a product value of the mantissas; right shifting the product value of the mantissa by the shift amount to generate a shifted product; selecting one or more first segments of an accumulator based on the high exponent, and adding the one or more first selected segments to the shifted product to generate a sum; and writing the generated sum into the selected one or more first segments of the accumulator.
Optionally, in the any of the two preceding aspects, wherein the accumulator includes one or more second segments consisting of the leftmost bits to a left of the one or more first segments; the accumulator includes one or more third segments consisting of the rightmost bits to a right of the one or more first segments; a second accumulator register divided into segments, in which each segment's value is one more than the value present in the corresponding segment of the accumulator; and a third accumulator register, divided into segments, in which each segment's value is one less than the value present in the corresponding segment of the accumulator.
Optionally, in the any of the two preceding aspects, wherein each segment in the accumulator include first and second flag register bits, the first and second flag register bits identify a state of each segment in the accumulator, and each of the first and second flag register bits are updated when bits in a corresponding.
Optionally, in the any of the two preceding aspects, wherein the first flag register includes all one bits and the second flag register includes all zero bits.
Optionally, in the any of the two preceding aspects, the method further comprising: updating the one or more second segments when the sum written into the one or more first segments of the accumulator is positive with a carry-out, wherein j+1 of the one or more second segments are loaded from a corresponding j+1 of the one or more of the corresponding second accumulator segments, j+1 of the one or more corresponding third accumulator-segments are loaded from a corresponding j+1 one or more second segments of the accumulator, j+1 of the one or more corresponding second accumulator register segments are incremented, and j represents a number of consecutive segments set with the first flag bits immediately to the left of the first segments.
Optionally, in the any of the two preceding aspects, the method further comprising updating the one or more second segments when the sum written into the one or more first segments of the accumulator is negative, wherein j+1 of the one or more second segments are loaded from a corresponding j+1 of the one or more of the corresponding third accumulator segments, j+1 of the one or more corresponding second accumulator segments are loaded from a corresponding j+1 one or more second segments of the accumulator, j+1 of the one or more corresponding third accumulator register segments are decremented, and j represents a number of consecutive segments set with the second flag bits immediately to the left of the first segments.
Optionally, in the any of the two preceding aspects, wherein the one or more third segments consisting of the rightmost bits to the right of the one or more first segments remain unchanged by selecting the one or more third segments output as the next input to the corresponding one or more third segments or by clock-gating the segments off.
Optionally, in the any of the two preceding aspects, wherein the floating-point dot product is a 2's complement format of the values of the registers of the accumulator for all segments.
Optionally, in the any of the two preceding aspects, wherein when the floating-point dot product is positive, the floating-point dot product is a sign-magnitude format of the values of the registers of the accumulator for all segments, excluding the most significant bit of the registers; when the floating-point dot product is negative, the floating-point dot product is a sign-magnitude format of a 1's complement of the values of the registers of the third accumulator for all segments, excluding the most significant bit of the registers, and the sign-bit of the floating-point dot product is the most significant bit of the registers of the accumulator regardless of whether the most-significant bit of the registers of the accumulator was positive or negative.
According to one further aspect of the present disclosure, a microprocessor includes microprocessor, comprising a first input register configured to hold a sequence of first floating-point numbers of a first operand, the sequence of the first floating-point numbers having a sign, a mantissa value and an exponent value; a second input register configured to hold a sequence of second floating-point numbers of a first operand, the sequence of the second floating-point numbers having a sign, a mantissa value and an exponent value; and a floating-point unit connected to the first and second input registers and configured to compute the floating-point dot-product of the sequence of the first floating-point numbers and the sequence of the second floating-point numbers, the floating-point unit comprising an adder adding the exponent values of the sequence of the first and the second floating-point numbers to determine an exponent product, the exponent product having a high exponent and a low exponent; a set of inverters, one for each bit of the low exponent, which calculate a shift amount as a one's complement of the low exponent; a multiplier multiplying the mantissas of the sequence of the first and second floating-point numbers to determine a product value of the mantissas; a shifter, right shifting the product value of the mantissa by the shift amount to generate a shifted product; a multiplexor selecting one or more first segments of an accumulator based on the high exponent, and adding the one or more first selected segments to the shifted product to generate a sum; and a set of multiplexors, one for each segment of the accumulator, which write the generated sum into the selected one or more first segments of the accumulator.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter. The claimed subject matter is not limited to implementations that solve any or all disadvantages noted in the Background.
Aspects of the present disclosure are illustrated by way of example and are not limited by the accompanying figures for which like references indicate elements.
The disclosed technology generally relates to microprocessor-based devices for performing floating-point arithmetic. More specifically, the disclosure relates to an accumulator, such as a Kulisch Accumulator, which allows for removal of sources of rounding errors in a sum of products. In one embodiment, this is achieved using an accumulator which contains an entire floating-point range, in which a single product is added to the accumulator every cycle. In particular, a very wide fixed-point accumulator is used whose bits cover the full exponent range of a product. For single precision, the weights of the bits can range from 2−126−23=2−149 to 2127. Multiplying two such single-precision numbers produces a result with a range from 2−298 to 2254, which is a range of 552 bits. In one embodiment, the fixed-point result is in sign-magnitude format.
It is understood that the present embodiments of the disclosure may be implemented in many different forms and that claims scopes should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete and will fully convey the inventive embodiment concepts to those skilled in the art. Indeed, the disclosure is intended to cover alternatives, modifications and equivalents of these embodiments, which are included within the scope and spirit of the disclosure as defined by the appended claims. Furthermore, in the following detailed description of the present embodiments of the disclosure, numerous specific details are set forth in order to provide a thorough understanding. However, it will be clear to those of ordinary skill in the art that the present embodiments of the disclosure may be practiced without such specific details.
The processing block 140 includes combinatorial logic 143 that is configured to execute instructions and registers 141 in which the combinatorial logic stores instructions and data while executing these instructions. In the simplified representation of
The following considers the calculation of floating-point dot-products, such as in the FPU 147 of
The dot product is a basic computation of linear algebra and is commonly used in deep learning and machine learning. In a single layer of a basic neural network, each neuron takes a result of a dot product as input, then uses its preset threshold to determine the output. Artificial neural networks are finding increasing usage in artificial intelligence applications and fields such as image and video recognition, recommender systems, image classification, medical image analysis, natural language processing, expert systems, autonomous (self-driving) vehicles, data mining, and many other applications. An artificial neural network is formed of a large number of layers through which an initial input is propagated. At each layer, the input will be a vector of values that is multiplied with a vector of weights as a dot-product to provide an output for the layer. Such artificial neural networks can have very large numbers of layers (network depth) and involve large numbers of dot-products within each of layer (network width), so that propagating an input through a network is extremely computationally intensive. In the training of an artificial neural network (i.e., the process of determining a network's weight values), a number of inputs are typically required to be repeatedly run through the network to determine accurate weight values. Given the increasing importance of artificial networks, the ability to efficiently compute large numbers of dot-products is of great importance.
In the Kulisch long accumulator (also referred to commonly as a superaccumulator), the length of the accumulator is chosen such that every bit of information of the input format can be represented (e.g., binary 64). This covers the range from the minimum representable FP value to the maximum value, independently of the sign. For example, the depicted Kulisch accumulator utilizes an accumulator of 4288 bits to handle the accumulation of products of 64-bit FP values. The addition is performed without loss of information by accumulating every FP input number in the long accumulator. The accumulator produces the exact result of a very large amount of FP numbers of arbitrary magnitude. However, this technique may suffer from very large memory overhead.
In the depicted architecture, there are N=[wa/b] words, where b is the number of bits. The shift operation is then decomposed into two—selecting the words in which to send a mantissa and shifting withing a word. If b is chosen as a power of two, b=2k, the intra-word shift distance is simply obtained as the k lower bits of the exponent, while the word address is obtained as the w′e−k leading bits. The shifted mantissa will typically be spread across multiple words. More specifically, after a shift of maximum size b−1, the shifted mantissa is of size w′f−1+b and is spread over
words. Using this technique, the two steps of the shift can be executed in parallel, and accumulating an input only requires adding it to S words (in the depicted example, S=3 such that the shifted mantissa spans three words). However, there are a few disadvantages. There is a carry propagation from one accumulator word to the next, which potentially requires to update all the words above the S target words. Additionally, there is a sign management issue. As input mantissa may be signed, they need to be either added or subtracted to the accumulator. In the case of subtraction, there is an issue of borrow propagation.
In one embodiment, the accumulator of
The S words of the shifted mantissa are sent to the stages to the right (numbered −S+1 to 0). The pipeline also transmits the mantissa sign, as well as the stage where the mantissa needs to be added. The stages between stage 0 and stage N−S are identical. The S mantissas are added in the current sub-adder whose stage match, plus the carry from the previous stage. The result of the sum is computed over b+log 2(S+2) bits, where b lower bits of the results are kept in the current sub-adder and the log 2(S+2) higher bits are the carry sent to next stage. On both ends of the pipelines, there are S−1 stages that perform the same computation but with fewer terms to sum, as illustrated.
While the pipelined accumulator pipelines the carry-propagation, this requires N additional ‘null’ cycles after the end of an accumulation with zero input to propagate the carries. For example, the N for a double-precision calculation is sixty-four (64).
In one embodiment, and with reference to
Turning to
Initialization begins at step 802 where an index value i is initialized to −2.
Initialization of the next segment (segment i, where i is initially set to −2 as in step 802) of the accumulator registers for the ACC registers 960, ACC+1 registers 958 and the ACC−1 registers 962 is then performed at step 806. For example, the ACC+1 registers 958 in segment i are initialized to 0x0000_0001, the ACC registers 960 in segment i are initialized to 0x0000_000 and the ACC−1 registers 962 in segment i are initialized to 0xFFFF_FFFF. At step 806, the ACC register “All-one's” and “All-zeros” flags for segment i are initialized (as described above) and the index value i is incremented by 1 (i=i+1) at step 808.
At step 810, the index value i is checked to determine if it is equal to 17, which indicates that all 19 segments (segments 16 to −2) have been initialized. If i=17, then initialization of segments 16 to −2 have been completed, and the process proceeds to step 812, where the accumulator registers ACC, ACC+1 and ACC−1 are clocked. Otherwise, if the index value i<17, then not all of the segments have been initialized and the process returns to step 804 for continued processing.
Turning now to
Source operands A and B (also referred to herein as Src0 and Src1) are received and stored in registers or memory. These inputs may be for any number of different operations, including use in high performance computing (HPC), CPU's, GPU's, deep learning or neural networks, artificial intelligence, etc. The exponents AExp and BExp for each source operand may be sent to a multiplier where they may be added component by component. The mantissa Amant and BMant of each source operand may be sent to a mantissa multiplier for component-wise multiplication. The product for each of the components exponents is then split into a lower part and a higher part. For example, the exponents may be M bits, which is then split into a product consists of a high exponent (Exp_Hi) and a low exponent (Exp_Lo). Additionally, a sign Asign and Bsign are extracted from the source operands.
In one embodiment, each of the sign (s), exponent (Exp) and mantissa (Mant) operations may be performed concurrently, as separated by the dashed lines representing component paths for determining the sing, exponent and mantissa. For a single-precision accumulator, used for purposes of discussion throughout this disclosure, the sign is bit 31, the exponent is bits 30-23 and the mantissa is bits 22-0. Beginning with the exponents AExp and BExp in stage 930 (
In step 830, each exponent AExp and BExp is then zero-extended by one bit, and the exponents are added with a 9-bit carry propagate adder (CPA) 932 (i.e., zero-extend AExp bits [8:0]+zero-extend BExp bits [8:0]=sum of the exponents, which is 9-bits [8:0]) with a carry-in (Cin) bit of ‘1.’ From the 9-bit sum output from the CPA 932, the low and high bits are extracted. Specifically, bits [8:5] of the sum are extracted as the low exponent bits (Exp_Lo[4:0]), which may be used to calculate a shift amount from the exponent sum, at step 832. Bits [4:0] of the product are extracted as the high exponent bits (Exp_Hi[8:5]), which may be used to calculate a segment start, at step 834. The embodiments in the examples assume that the product of the exponents (ProductExp[8:0])=(Exp_Hi[8:5], ExpLo[4:0]). That is, the Exp_Hi and Exp_Lo are of equal widths. However, it is appreciated that these techniques are applicable to any combination of widths of Exp_Hi and Exp_Lo which total to the product of the exponents. Once the high and low exponents are extracted, a shift amount may be calculated by inverting (INV), with INV 938, the low exponent (NOT(Exp_Lo[4:0]), which is the equivalent of subtracting the low exponent from 31 bits. The high exponent will be used in step 834 to calculate a starting segment.
For the mantissa bits 23:0, a non-zero exponent represents a normal floating-point value and gets a leading mantissa bit of ‘1.’ A zero exponent which represents a denormal floating-point value gets a leading mantissa bit of ‘0.’ This is expressed as: Mantissa[23]=OR(exponent[7:0]).
At step 824, a multiplier, such as a 24×24 bit unsigned multiplier 934, multiplies two FP significands (AMant*BMant) to form a 48-bit unsigned product 943 (product[47:0]). The one's complement of the lower 5-bits of the product exponent are used to right-shift the 48-bit unsigned product by the shift amount calculated in step 832. The leftmost 4-bits of the product exponent are decoded to form 1-hot control bits (control signals) into multiplexers (MUXes) and shifters, such as a barrel shifter (not shown). In the present example, the maximum shift amount of the significand is 31-bits. Thus, the total output width of the shifter is 48 bits+31 bits=79 bits. Specifically, the 79-bit shifted product 945 is created by right shifting the signed mantissa by the shift amount (Shifted_Product[78:0]=Signed_Mant[47:0]>>Shifted Amount[4:0])).
Although not shown in
At step 836, three consecutive segments of the accumulator to be added to the 79-bit signed and shifted product 947 are selected using word selector 942. As shown in
The 96-bit signed and shifted sum 948 is written into the selected three consecutive ACC segments 952 (or ‘Mid’ segment) of the accumulator, as shown in
The sum being written into the 3 segments of the accumulator ACC from the 96-bit signed and shifted sum 948 is represented as the following: 1) ACC[ExpHi[3:0]][31:0]=Sum[95:64], 2) ACC_In[ExpHi[3:0]-1][31:0]=Sum[63:32], and 3) ACC_In[ExpHi[3:0]-2][31:0]=Sum[31:0]. Segments written to the right (right segment 956) of the mid segment 952 are written with the previous value of the ACC (these bits will be unchanged in the next clock cycle.
For segments to the left (left segment 954) of the mid segment 952, ACC, ACC+1, or ACC−1 values are selected to be written back to the accumulator. The values ACC, ACC+1 and ACC−1 are maintained in registers 958, 960 and 962, respectively, as illustrated in
In the illustrated example of
Continuing with reference to
At step 840, if the 96-bit signed and shifted sum 948 is positive with a carry-out, then the process proceeds to step 852. At step 842, if the 96-bit signed and shifted sum 948 is negative, then the process proceeds to step 846. At step 842, if the 96-bit signed and shifted sum 948 is positive with no carry-out, then the process proceeds to step 844.
At step 852, for all consecutive segments with the ACC “All Ones” flag set (starting at segment i) and incrementing i until and including the first segment with its “All Ones” flag is NOT set, at step 854) perform the following operations: copy the ACC+1 segments 958 into the ACC segments 960, copy ACC segments 960 to the ACC−1 segments 962, increment the ACC+1 segments 958 and calculate new ‘All ones’ and ‘All zeros’ flags based on the value of the ACC register segments.
Continuing step 856, the index value i is continually incremented by 1 (i=i+1). If i<17, then the ACC, ACC+1, and ACC−1 register segments corresponding to each index i is left unchanged, either by loading each register segment from itself or by clock-gating each segment OFF. When i=17, then the process proceeds to step 858 to process more operands.
At step 846, for all consecutive segments with the ACC “All Zeros” flag set (starting at segment i) and incrementing i until and including the first segment with its “All Zeros” flag is NOT set, at step 848) perform the following operations: copy the ACC−1 segments 962 into the ACC segments 960, copy ACC segments 960 to the ACC+1 segments 958, decrement the ACC−1 segments 962 and calculate new ‘All ones’ and ‘All zeros’ flags based on the value of the ACC register segments.
Continuing step 850, the index value i is continually incremented by 1 (i=i+1). If i<17, then the ACC, ACC+1, and ACC−1 register segments corresponding to each index i is left unchanged, either by loading each register segment from itself or by clock-gating each segment OFF. When i=17, then the process proceeds to step 858 to process more operands.
At step 844, the index value i is continually incremented by 1 (i=i+1). If i<17, then the ACC, ACC+1, and ACC−1 register segments corresponding to each index i is left unchanged, either by loading each register segment from itself or by clock-gating each segment OFF. When i=17, then the process proceeds to step 858 to process more operands.
When the process reaches step 858, the ALU 147 determines whether any additional operands require being multiplied and added to the accumulator. If there are additional operands, then the process returns to step 822 and restarts the accumulate process. If no further operands remain, then the process proceeds to step 860 in which to determine the output format—either two's complement output 862 or sign-magnitude output 864. If the two's complement output 862 is selected, the output may be taken directly from the ACC register 960 (the 556 bits from the accumulator as shown in
Accordingly, using the accumulator of the discloses embodiments, (1) a portion of the accumulator register may be selected such that is it added to an incoming product. This is in contrast to conventional techniques of shifting a product and adding it to the accumulator register; (2) a single adder (CPA) may be employed instead of multiple adders; (3) ACC+1 and ACC−1 registers are used in addition to the ACC registers, which allows for fast updating of carry and borrow to the left of the accumulated sum; and (4) a sign-magnitude output may be formed by inverting the ACC−1 register instead of requiring an increment of the ACC registers, which is time consuming.
With reference to
Turning back to
With reference to
Turing to operand 2, Src0 and Src1 equal 0x00000001 with a positive sign of 0, an exponent of 0 and a mantissa of 0x000001 (blocks 1004 and 1006 of operand 2). The product of Src0_Mant and Src1_Mant is calculated as 0x0000_0000_0001, and the product exponent is 0x003 with Exp_Hi equal to 0 and Exp_Lo equal to 3 (block 1010 of operand 2). The shift amount is calculated as 28 (31−Exp_Lo, which is Exp_Low inverted) (block 1012 of operand 2), and the start segment is 0 such that the three segment values n, n−1 and n−2 are 0x0000_0000, 0x0000_0000 and 0x0010 (block 1014 of operand 2). The three segment values are added beginning at segment 0 (block 1054) and continuing into segments −1 and −2. Since the sum is positive (sign=0), everything to the left (left segment) of three segments (mid segment, block 1053) is selected and copied from the previous accumulator values (copied from block 1052).
Turning to operand 3, Src0 equals 0x40000001 with a negative sign of 1, an exponent of 80 and a mantissa of 80001 (block 1004 of operand 3), and Src1 equals 0x40000001 with a positive sign of 0, an exponent of 80 and a mantissa of 0x80001 (block 1006 of operand 3). The mantissa product and the product exponent (Exp_Hi and Exp_Lo) are calculated (blocks 1008 and 1010 of operand 3), and the shift amount (block 1012 of operand 3) is determined to be 30 (31−Exp_Lo, which is Exp_Low inverted). The product exponent-10 bits 2's complements (biased) and the product exponent (unbiased) are also calculated in blocks 1008A and 1008B. The product of Src0_Mant and Src1_Mant is calculated as 4000_0100_0001 (block 1008 of operand 3), and the product exponent is 0x101 with Exp_Hi equal to 8 and Exp_Lo equal to 1 (block 1010 of operand 3). The shift amount (block 1012 of operand 3) is calculated as 30 (31−Exp_Lo, which is Exp_Low inverted), and the start segment is 8 such that the three segment values n, n+1 and n+2 are 0xFFFF_FFFE, 0xFFFF_FCFF and 0xFFFC (since the sum is negative, we take the 2's complement of the mantissa product to get the segment values) (block 1014 of operand 3). However, in this case, the sum is negative (as indicated by the sign=1) such that the ‘n+1’ ACC−1 segments are selected and copied into the segments to the left (left segment) of the three selected registers (block 1057 of operand 3), where ‘n’ represents the number of consecutive ‘All zeros’ segments immediately to the left of the added segments. For example, taking the three consecutive segments in block 1057, and adding the values n, n+1 and n+2 to get the new value to write into the accumulator in the mid segment (block 1057). The values of ACC registers in block 1054 are copied to the ACC registers (block 1056) and update flags are updated to “All ones.” The value of ACC+1 and ACC−1 registers are respectively written into ACC+1 (values become all zeros) and ACC−1 (values become 0xFFFF_FFFE) in block 1056.
In one other embodiment, if the sum is positive with a carry-out, the ‘n+1’ ACC+1 segments to the left (left segment in block 1056) are selected for copying into the left segment registers of block 1058, where ‘n’ represents the number of consecutive ‘All one's” segments immediately to the left of the added segments (block 1059).
It is appreciated that any number of segments and bits may be used for any floating-point format and that the depicted embodiment is but one example embodiment.
Any number of independent parameters may be selected to determine the configuration of a particular embodiment. For example, a designer may select the independent parameters to determine the configuration. More specifically, the configuration of the embodiment may be determined by the floating-point format selected, the total number of bits per product segment, and the number of bits in the overflow segment:
Number of bits in exponent, nBE=exponent width of the floating-point format;
Number of bits in mantissa, nBM=mantissa width of the floating-point format;
Number of bits in normal segment, nBNS=segment width in bits (limited to a power of 2); and
Number of bits in overflow segment, nBOS.
A number of dependent parameters are calculated based on the values of the independent parameters:
Number of bits in product, nBP=2*nBM;
Number of bits in product exponent, nBPE=1+nBE;
Number of normal product segments, nPS=2nBP/nBNS;
Number of bits in normal product segments, nBNPS=2nBP,
Maximum number of bits in additional segments due to denormals=nBSD=nBP−1;
Number of bits in exponent low, nBEL=log2(nBNS); Number of bits in exponent high, nBEH=log2(nPS); Maximum shift count nMAXSHFT=nBP nBS−1; and Number of segments selected to be added to product nSEG=Ceiling(nMAXSHFT/nBNS).
In the depicted embodiment, the independent parameters are: nBE=8, nBM=24, nBNS=32, and nBOS=32.
In the depicted embodiment, the dependent parameters are: nBP=2*nBM=2*24=48, nBPE=1+nBE=1+8=9, nPS=nBP/nBNS=29/32=512/32=16, nBNPS=2nBP=29=512, nBSD=nBP−1=48−1=47, nBEL=log2(nBNS)=log2(32)=5, nBEH=log2(nPS)=log2(16)=4, nMAXSHFT=nBP nBNS−1=48+32−1=79 and nSEG=Ceiling(nMAXSHFT nBNS)=ceiling(79/32)=ceiling(2.46875)=3.
The network system may comprise a computing system 1101 equipped with one or more input/output devices, such as network interfaces, storage interfaces, and the like. The computing system 1101 may include a central processing unit (CPU) 1110, a memory 1120, a mass storage device 1130, and an I/O interface 1160 connected to a bus 1170, where the CPU can include a microprocessor such as described above with respect to
The CPU 1110 may comprise any type of electronic data processor, including the microprocessor 1120 of
The mass storage device 1130 may comprise any type of storage device configured to store data, programs, and other information and to make the data, programs, and other information accessible via the bus 1170. The mass storage device 1130 may comprise, for example, one or more of a solid-state drive, hard disk drive, a magnetic disk drive, an optical disk drive, or the like.
The computing system 1101 also includes one or more network interfaces 1150, which may comprise wired links, such as an Ethernet cable or the like, and/or wireless links to access nodes or one or more networks 1180. The network interface 1150 allows the computing system 1101 to communicate with remote units via the network 1180. For example, the network interface 1150 may provide wireless communication via one or more transmitters/transmit antennas and one or more receivers/receive antennas. In an embodiment, the computing system 1101 is coupled to a local-area network or a wide-area network for data processing and communications with remote devices, such as other processing units, the Internet, remote storage facilities, or the like. In one embodiment, the network interface 1150 may be used to receive and/or transmit interest packets and/or data packets in an ICN. Herein, the term “network interface” will be understood to include a port.
The components depicted in the computing system of
The technology described herein can be implemented using hardware, firmware, software, or a combination of these. Depending on the embodiment, these elements of the embodiments described above can include hardware only or a combination of hardware and software (including firmware). For example, logic elements programmed by firmware to perform the functions described herein is one example of elements of the described FPU. An FPU can include a processor, FGA, ASIC, integrated circuit or other type of circuit. The software used is stored on one or more of the processor readable storage devices described above to program one or more of the processors to perform the functions described herein. The processor readable storage devices can include computer readable media such as volatile and non-volatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer readable storage media and communication media. Computer readable storage media may be implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Examples of computer readable storage media include RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. A computer readable medium or media does (do) not include propagated, modulated or transitory signals.
Communication media typically embodies computer readable instructions, data structures, program modules or other data in a propagated, modulated or transitory data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as RF and other wireless media. Combinations of any of the above are also included within the scope of computer readable media.
In alternative embodiments, some or all of the software can be replaced by dedicated hardware logic components. For example, and without limitation, illustrative types of hardware logic components that can be used include Field-programmable Gate Arrays (FPGAs), Application-specific Integrated Circuits (ASICs), Application-specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), special purpose computers, etc. In one embodiment, software (stored on a storage device) implementing one or more embodiments is used to program one or more processors. The one or more processors can be in communication with one or more computer readable media/storage devices, peripherals and/or communication interfaces.
It is understood that the present subject matter may be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided so that this subject matter will be thorough and complete and will fully convey the disclosure to those skilled in the art. Indeed, the subject matter is intended to cover alternatives, modifications and equivalents of these embodiments, which are included within the scope and spirit of the subject matter as defined by the appended claims. Furthermore, in the following detailed description of the present subject matter, numerous specific details are set forth in order to provide a thorough understanding of the present subject matter. However, it will be clear to those of ordinary skill in the art that the present subject matter may be practiced without such specific details.
Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatuses (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable instruction execution apparatus, create a mechanism for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
The description of the present disclosure has been presented for purposes of illustration and description but is not intended to be exhaustive or limited to the disclosure in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the disclosure. The aspects of the disclosure herein were chosen and described in order to best explain the principles of the disclosure and the practical application, and to enable others of ordinary skill in the art to understand the disclosure with various modifications as are suited to the particular use contemplated.
For purposes of this document, each process associated with the disclosed technology may be performed continuously and by one or more computing devices. Each step in a process may be performed by the same or different computing devices as those used in other steps, and each step need not necessarily be performed by a single computing device.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.
This application is a continuation of, and claims priority to, PCT Patent Application No. PCT/US2020/044664, entitled “SINGLE-CYCLE KULISCH ACCUMULATOR”, filed Jul. 31, 2020, which claims priority to U.S. Provisional Application No. 63/032,572, entitled “SINGLE-CYCLE KULISCH ACCUMULATOR”, filed May 30, 2020, which applications are incorporated by reference herein in their entirety.
Number | Date | Country | |
---|---|---|---|
63032572 | May 2020 | US |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/US2020/044664 | Jul 2020 | US |
Child | 18071426 | US |