The invention relates to a vector processor, and more particularly, to a vector processor configured to perform vector reduction and element reduction.
Single instruction multiple data (SIMD) is widely used for parallel data processing of vector processors. In general, vector processors may use vector reduction and element reduction to reduce vector data to scalar values. However, when vector reduction and element reduction are implemented in a fully pipelined manner in the prior art, due to the doubling of computational logic and the huge wire connection for element data shuffling, circuit area bloating could increase power dissipation, and also creates congestion problems and timing problems. Moreover, when the vector processor is configured for floating point reduction, dot product, larger vector register length (VLEN), or data path length (DLEN) such as 512, 1024, or 2048 bits, the above problems are exacerbated.
The invention provides a vector processor and a vector and element reduction method thereof that may flexibly adjust the number of iterations based on optimized hardware performance indicators or software performance indicators.
An embodiment of the invention provides a vector processor. The vector processor includes a vector register file, a first lane, and a second lane. The first lane is coupled to the vector register file to load a first operand and a first part of a second operand based on a first state parameter, and the first lane performs a first reduction operation on the first operand and the first part of the second operand to generate a first part of a first reduction result. The second lane is coupled to the vector register file to load a second part of the second operand based on the first state parameter, and the second lane uses the second part of the second operand as a second part of the first reduction result. One of the first lane or the second lane performs a second reduction operation on the first part and the second part of the first reduction result based on a second state parameter to generate a second reduction result.
An embodiment of the invention provides a vector reduction method. The vector reduction method includes: loading a first operand and a first part of a second operand based on a first state parameter, and performing a first reduction operation on the first operand and the first part of the second operand to generate a first part of a first reduction result; loading a second part of the second operand based on the first state parameter, and using the second part of the second operand as a second part of the first reduction result; and performing a second reduction operation on the first part and the second part of the first reduction result based on a second state parameter to generate a second reduction result.
An embodiment of the invention provides a vector processor. The vector processor includes a vector register file and a first lane. The first lane is coupled to the vector register file to load a first operand and a second operand based on a first state parameter and performs a first reduction operation on the first operand and the second operand to generate a first reduction result, and performs a second reduction operation on a first part and a second part of the first reduction result based on a second state parameter to generate a second reduction result.
An embodiment of the invention provides an element reduction method. The element reduction method includes: loading a first operand and a second operand based on a first state parameter and performing a first reduction operation on the first operand and the second operand to generate a first reduction result, and performing a second reduction operation on a first part and a second part of the first reduction result based on a second state parameter to generate a second reduction result.
Based on the above, in some embodiments of the invention, the vector processor may execute different steps in the reduction operation with the same circuit based on the state parameters, thereby saving circuit area and improving reduction operation performance. Moreover, the vector processor may perform the vector reduction operation and the element reduction operation with the same circuit structure, so as to further save circuit area.
In order to make the aforementioned features and advantages of the disclosure more comprehensible, embodiments accompanied with figures are described in detail below.
The term “coupled to (or connected to)” used in the entire text of the specification of the present application (including claims) may refer to any direct or indirect connecting means. For example, if the text describes a first device is coupled to (or connected to) a second device, then it should be understood that the first device may be directly connected to the second device, or the first device may be indirectly connected to the second device via other devices or certain connecting means. Moreover, when applicable, devices/components/steps having the same reference numerals in figures and embodiments represent the same or similar parts. Elements/components/steps having the same reference numerals or having the same terminology in different embodiments may be cross-referenced.
In the present embodiment, the operand VS1[E0] (Element 0) in the operand VS1[E*] read from the vector register file 110 needs to be reduced, and the part of the operand VS1[E*] except the operand VS1[E0] is masked (no reduction operation is needed, that is, inactive elements) and filled with the inactive value INAV, thus an operand adjVS1[E*] (adjusted first operand) is generated. In particular, VS1[E*] represents all elements in the operand VS1, and the operand VS1[E0] represents the 0th element of VS1. It should be noted that the operation of the elements filled with the inactive value INAV is an invalid operation, so in fact, although the reduction operation is still performed, the result is equivalent to no reduction operation.
A plurality of multiplexers MUX2 may select the elements that do not need to be masked (a reduction operation is needed) in an operand VS2[E*] read from the vector register file 110 based on a mask-bit VM[*], and replace the elements in the operand VS2[E*] that need to be masked (a reduction operation is not needed, that is, inactive elements) with the inactive value INAV, thus an operand adjVS2[E*] (adjusted second operand) is generated. In particular, the mask-bit VM[*] represents all mask bits.
Then, the multiplexer MUX3 may select the operand adjVS1[E*] based on the state parameter STATE corresponding to the Initial State 202 as the input source SRC1, and the multiplexer MUX4 may select the operand adjVS2[E*] based on the state parameter STATE corresponding to the Initial State 202 as the input source SRC2. The arithmetic logic unit ALU is coupled to the output of the multiplexer MUX3 and the output of the multiplexer MUX4. The arithmetic logic unit ALU performs arithmetic logic operations on the input source SRC1 and the input source SRC2 to generate a lane output LCO[E*].
Regarding the arithmetic logic operations performed by the arithmetic logic unit ALU on the input source SRC1 and the input source SRC2 in the Initial State 202, please refer to
For example, in the Initial State 202, the vector processor 10 may load the operand adjVS1[L0] to the register ACC[L0] and load the operand adjVS2[L0] (not shown) to VN[L0], and use the accumulation result of the operand adjVS1[L0] and the operand adjVS2[L0] as the lane output LCO[L0]. The vector processor 10 loads the inactive value INAV to the register ACC[L 1] via the operand adjVS1[L 1] and loads the operand adjVS2[L 1] to the register VN[L1], and uses the accumulation result (that is, the operand adjVS2[L1]) of the inactive value INAV and the operand adjVS2[L1] as the lane output LCO[L1]. The vector processor 10 loads the inactive value INAV to the register ACC[L2] via the operand adjVS1[L2] and loads the operand adjVS2[L2] to the register VN[L2], and uses the accumulation result of the inactive value INAV and the operand adjVS2[L2] as the lane output LCO[L2]. The vector processor 10 loads the inactive value INAV to the register ACC[L3] via the operand adjVS1[L3] and loads the operand adjVS2[L3] to the register VN[L3], and uses the accumulation result of the inactive value INAV and the operand adjVS2[L3] as the lane output LCO[L3]. In an embodiment, the lane output LCO[L0] to the lane output LCO[L3] are, for example, 64 bits respectively, for a total of 256 bits.
Returning to
In particular, LMUL is the vector length multiplier. When the vector length multiplier LMUL is 1, one command may operate one vector register, and when the vector length multiplier LMUL is greater than 1, one command may operate LMUL vector registers. The vector length multiplier LMUL combines a plurality of vector registers into one vector register group. For example, if the vector length multiplier LMUL is 4 in the vector reduction operation, the operand adjVS2[E*], i.e. one vector register group, consists of 4 vector registers. VLEN is the vector register length, that is the width of each vector register in the vector register file 110, for example, 256 bits. The vector register length VLEN is equal to the sum of the widths of the vector register bank 111, the vector register bank 112, the vector register bank 113, and the vector register bank 114. DLEN is the data path length, that is the data width of one operation, for example, 256 bits. In an example of the invention, the vector register length VLEN is equal to the data path length DLEN, but the vector register length VLEN may also not be equal to the data path length DLEN, which is not limited thereto.
Specifically, referring to
Please refer to
For example, in
After the Lanes Reduction State 204 in step S220 is completed, the vector processor 10 may determine whether the element length ELEN is smaller than the length of a single lane, and based on the determination result, decide whether to perform one of a normal reduction operation or a fast reduction operation on the reduced single lane output LCO_L0. When the element length ELEN is less than the length of a single lane, the state parameter STATE is changed to the Single Lane Reduction State 205 in step S230 to perform one of a normal reduction operation or a fast reduction operation on the reduced single lane output LCO_L0. When the element length ELEN is equal to the length of a single lane, the state parameter STATE is changed to the Idle/Complete State 201 without performing any reduction operation on the reduced single lane output LCO_L0, and the value of the reduced single lane output LCO_L0 is used as the result of the vector reduction operation.
In an embodiment, the length of a single lane is, for example, 64 bits. When the element length ELEN is less than 64 bits, the vector processor 10 enters the Single Lane Reduction State 205 in step S230 to perform one of a normal reduction operation or a fast reduction operation, and when the element length ELEN is equal to 64 bits, the vector processor 10 enters the Idle/Complete State 201 without performing any of a normal reduction operation or a fast reduction operation. It should be mentioned that, in the Single Lane Reduction State 205 in step S230, based on design requirements, the vector processor 10 may perform the normal reduction operation via the multiplexer MUX3, the multiplexer MUX4, the multiplexer MUX5, and the arithmetic logic unit ALU in the lane 121, or may perform the fast reduction operation via the multiplexer MUX3, the multiplexer MUX4, the multiplexer MUX5, the arithmetic logic unit ALU, and the fast reduction circuit 310 in the lane 121. The selection of the normal reduction operation and the fast reduction operation may be realized by the operator OP for the multiplexer MUX5. For example, when the operator OP is arithmetic logic reduction such as SUM reduction, the normal reduction operation is selected, and when the operator OP is bitwise logic reduction, such as OR reduction, a fast reduction operation is selected, but the invention is not limited thereto.
When the element length ELEN is 8 bits, the vector processor 10 uses the bytes B6, B4, B2, and B0 in the lane output LCO_L0 (second reduction result) as the input source SRC1, and uses the bytes B7, B5, B3, and B1 in the lane output LCO_L0 as the input source SRC2. Specifically, the multiplexer MUX3 may select the even-numbered part EVEN of the lane output LCO_L0 as the input source SRC1 based on the state parameter STATE corresponding to the normal reduction operation, and the multiplexer MUX4 may select the odd-numbered part ODD of the lane output LCO_L0 as the input source SRC2 based on the state parameter STATE corresponding to the normal reduction operation. In an embodiment, the arithmetic logic unit ALU may add 4 groups of 8′b0 to the input source SRC1 and the input source SRC2 respectively, and performs 8 sets of accumulation operations with an operation width SIMD_SIZE of 8 bits on the input source SRC1 and the input source SRC2, thereby generating the bytes HW3, HW2, HW1, and HW0, wherein the bytes HW3, HW2, HW1, and HW0 are all 16-bit. In another embodiment (not shown), the input source SRC1 and the input source SRC2 are accumulated in 4 sets with an operation width SIMD_SIZE of 8 bits, and 8′b0 are added to the accumulation result respectively to perform zero-extension to generate the bytes HW3, HW2, HW1, and HW0, wherein the bytes HW3, HW2, HW1, and HW0 are all 16 bits. Please note that the accumulation result is located in the low-order bits of a byte, and the zero-extension is to pad 0s to the high-order bits of a byte. For example, the accumulation result of the byte HW3 is located in the lower 8 bits of the 16 bits, and the filled 8′b0 are located in the upper 8 bits of the 16 bits. The following is the same, and is not repeated herein. It is worth mentioning that, in the present embodiment, when the SIMD_ALU performs a sum operation on 8 bits, the operation result may only be stored in one 8 bits and may not be carried into the 9th bits. That is to say, since the carry-in part is discarded, the zero-extension in the input source or in the accumulation result does not affect the final result.
Next, the vector processor 10 uses the bytes HW2 and HW0 as the input source SRC1, and uses the bytes HW3 and HW1 as the input source SRC2. Specifically, the multiplexer MUX3 may select the bytes HW2 and HW0 as the input source SRC1 based on the state parameter STATE corresponding to the normal reduction operation, and the multiplexer MUX4 may select the bytes HW3 and HW1 as the input source SRC2 based on the state parameter STATE corresponding to the normal reduction operation. The arithmetic logic unit ALU may add 2 groups of 16′b0 to the input source SRC1 and the input source SRC2 respectively, and performs 4 sets of accumulation operations with an operation width SIMD_SIZE of 16 bits on the input source SRC1 and the input source SRC2, thereby generating the bytes W1 and W0, wherein the bytes W1 and W0 are both 32-bit. In another embodiment (not shown), the input source SRC1 and the input source SRC2 are accumulated in 2 sets with an operation width SIMD_SIZE of 16 bits, and 16′b0 are added to the accumulation result to perform zero-extension to generate the bytes W1 and W0, wherein the bytes W1 and W0 are both 32 bits.
Next, the vector processor 10 uses the byte W0 as the input source SRC1, and uses the byte W1 as the input source SRC2. Specifically, the multiplexer MUX3 may select the byte W0 as the input source SRC1 based on the state parameter STATE corresponding to the normal reduction operation, and the multiplexer MUX4 may select the byte W1 as the input source SRC2 based on the state parameter STATE corresponding to the normal reduction operation. The arithmetic logic unit ALU may add 1 group of 32′b0 to the input source SRC1 and the input source SRC2 respectively, and performs 2 sets of accumulation operations with an operation width SIMD_SIZE of 32 bits on the input source SRC1 and the input source SRC2, thereby generating a byte DW0 (that is, double-word), wherein the byte DW0 is 64-bit. In another embodiment (not shown), the input source SRC1 and the input source SRC2 are accumulated in 1 set with an operation width SIMD_SIZE of 32 bits and 32′b0 are added to the accumulation result to perform zero-extension to generate the byte DW0, wherein the byte DW0 is 64 bits and used as the normal reduction output NOUT of the normal reduction operation (that is, the normal reduction result, corresponding to the result of the lane output LCO[E*] in the Single Lane Reduction State 205). When the element length ELEN is 16 bits, the vector processor 10 uses the bytes HW2 and HW0 in the lane output LCO_L0 (second reduction result) as the input source SRC1, and uses the bytes HW3 and HW1 in the lane output LCO_L0 as the input source SRC2. For the subsequent process, please refer to the related content that the element length ELEN is 8 bits, which is not repeated herein. Similarly, when the element length ELEN is 32 bits, the vector processor 10 uses the byte W0 in the lane output LCO_L0 (second reduction result) as the input source SRC1, and uses the byte W1 in the lane output LCO_L0 as the input source SRC2. For the subsequent process, please refer to the related content that the element length ELEN is 8 bits, which is not repeated herein. Compared to
In an embodiment, the fast reduction circuit 310 may divide the lane output LCO_L0 into 8 bytes such as bytes B7 to B0, and each of the bytes B7 to B0 consists of 8 bits, wherein the bytes B7, B5, B3, and B1 belong to an odd-numbered part ODD, and the bytes B6, B4, B2, and B0 belong to an even-numbered part EVEN. The difference between
Referring to
Accordingly, in the same cycle, the fast reduction circuit 310 provides the bytes HW3, HW2, HW1, and HW0 to the multiplexer MUX7 as data HW. Moreover, the fast reduction circuit 310 accumulates the byte B7 and the byte B6, and adds 16 0s to the accumulation result to perform zero-extension (that is, 16′b0 in
When the element length ELEN=8 or 16, the multiplexer MUX7 selects the data W′ to be loaded into the bytes W1 and W0 respectively. When the element length ELEN=32, the multiplexer MUX7 selects the data HW to be loaded into the bytes W1 and W0 respectively. In the same cycle, the fast reduction circuit 310 accumulates the byte W1 and the byte W0, and adds 32 zeros to the accumulation result to perform zero-extension (that is, 32′b0 in
In other words, in the fast reduction operation, the fast reduction circuit 310 uses a plurality of (smaller width) arithmetic logic units ALUs and multiplexers, so that all accumulation operations and selection operations may be completed in one cycle. Compared with the normal reduction operation, the fast reduction circuit 310 does not need a plurality of additional cycles to perform the iteration operations, thus improving the efficiency of the reduction operation.
Returning to
In
In the Initial State 802, in the arithmetic logic operation performed by the arithmetic logic unit ALU on the input source SRC1 and the input source SRC2, please refer to
It should be noted that the difference between
For example, when the sub-element length SELEN is 8 bits and the element length ELEN is 16 bits, the vector processor 10 may use the bytes B6, B4, B2, B0 in the lane output LCO[LM](first reduction result) as the input source SRC1 and use the bytes B7, B5, B3, and B1 in the lane output LCO[LM] as the input source SRC2. Specifically, the multiplexer MUX3 may select the even-numbered part EVEN of the lane output LCO[LM] as the input source SRC1 based on the state parameter STATE corresponding to the normal reduction operation, and the multiplexer MUX4 may select the odd-numbered part ODD of the lane output LCO[LM] as the input source SRC2 based on the state parameter STATE corresponding to the normal reduction operation. In an embodiment, the arithmetic logic unit ALU may add 4 groups of 8′b0 to the input source SRC1 and the input source SRC2 respectively, and performs 8 sets of accumulation with an operation width SIMD_SIZE of 8 bits on the input source SRC1 and the input source SRC2, thereby generating the bytes HW3, HW2, HW1, and HW0, wherein the bytes HW3, HW2, HW1, and HW0 are all 16-bit and used as the normal reduction output NOUT (that is, normal reduction result, corresponding to the lane output LCO[LM]).
If the sub-element length SELEN is 8 bits and the element length ELEN is 64 bits, following the above, after the bytes HW3, HW2, HW1, and HW0 are generated, the vector processor 10 uses the bytes HW2 and HW0 as the input source SRC1, and uses the bytes HW3 and HW1 as the input source SRC2. Specifically, the multiplexer MUX3 may select the bytes HW2 and HW0 as the input source SRC1 based on the state parameter STATE corresponding to the normal reduction operation, and the multiplexer MUX4 may select the bytes HW3 and HW1 as the input source SRC2 based on the state parameter STATE corresponding to the normal reduction operation. In an embodiment, the arithmetic logic unit ALU may add 2 groups of 16′b0 to the input source SRC1 and the input source SRC2 respectively, and performs 4 sets of accumulation with an operation width SIMD_SIZE of 16 bits on the input source SRC1 and the input source SRC2, thereby generating the bytes W1 and W0, wherein the bytes W1 and W0 are both 32-bit. Next, the vector processor 10 uses the byte W0 as the input source SRC1, and uses the byte W1 as the input source SRC2. Specifically, the multiplexer MUX3 may select the byte W0 as the input source SRC1 based on the state parameter STATE corresponding to the normal reduction operation, and the multiplexer MUX4 may select the byte W1 as the input source SRC2 based on the state parameter STATE corresponding to the normal reduction operation. In an embodiment, the arithmetic logic unit ALU may add 1 group of 32′b0 to the input source SRC1 and the input source SRC2 respectively, and performs 2 sets of accumulation with an operation width SIMD_SIZE of 32 bits on the input source SRC1 and the input source SRC2, thereby generating the byte DW0, wherein the byte DW0 is 64-bit and the byte DW0 is used as the normal reduction output NOUT (that is, normal reduction result, corresponding to the lane output LCO[LM]). Similarly, for the combination of other element lengths ELEN and sub-element lengths SELEN, please refer to the above. The difference between different sub-element lengths SELEN is that the starting position is different, and the difference between different element lengths ELEN is that the end position is different, which is not repeated herein.
In an embodiment, the fast reduction circuit 910 divides the lane output LCO[LM] into 8 bytes such as bytes B7 to B0, and each of the bytes B7 to B0 consists of 8 bits, wherein the bytes B7, B5, B3, and B1 belong to an odd-numbered part ODD, and the bytes B6, B4, B2, and B0 belong to an even-numbered part EVEN. The difference between
Referring to
Accordingly, in the same cycle, the fast reduction circuit 910 provides the bytes HW3, HW2, HW1, and HW0 to the multiplexer MUX9 as data HW. Moreover, the fast reduction circuit 910 accumulates the byte B7 and the byte B6 with the operation width SIZE of 16 bits, and adds 16 0s to the accumulation result to perform zero-extension (that is, 16′b0 in
When the sub-element length SELEN=8 or 16, the multiplexer MUX9 selects the data W′ to be loaded into the bytes W1 and W0 respectively. When the sub-element length SELEN=32, the multiplexer MUX9 selects the data HW to be loaded into the bytes W1 and W0 respectively. In the same cycle, the fast reduction circuit 910 accumulates the byte W1 and the byte W0 with an operation width SIZE of 32 bits, and adds 32 zeros to the accumulation result to perform zero-extension (that is, 32′b0 in
In the present embodiment, the multiplexer MUX10 receives the data HW′, the data W′, and the data DW0, and the multiplexer MUX10 selects one of the data HW′, the data W′, or the data DW0 as the fast reduction output FOUT (fast reduction result) based on the element length ELEN. Specifically, when the element length ELEN is 16 bits, the multiplexer MUX10 may select the data HW′ as the fast reduction output FOUT. When the element length ELEN is 32 bits, the multiplexer MUX10 may select the data W′ as the fast reduction output FOUT. When the element length ELEN is 64 bits, the multiplexer MUX10 may select the data DW0 as the fast reduction output FOUT.
In other words, in the fast reduction operation, the fast reduction circuit 910 uses a plurality of multiplexers and (smaller width) ALUs, so that all accumulation operations and selection operations may be completed in one cycle. Compared with the normal reduction operation, the fast reduction circuit 910 does not need a plurality of additional cycles to perform the iteration operations, thus improving the efficiency of the reduction operation.
It is worth mentioning that the arithmetic logic operations in the normal reduction operations of the invention are usually arithmetic operations, such as finding a maximum value MAX, finding a minimum value MIN, and finding an accumulated value SUM. Moreover, arithmetic logic operations in fast reduction operations are usually logic operations, such as logical AND, OR, and XOR.
In other embodiments, the accumulation operation described above may be supplemented with a saturation reduction operation. Specifically, furthermore each accumulation operation checks whether the accumulation result is above the maximum saturation value or below the minimum saturation value. If the accumulated result is greater than the maximum saturation value, the accumulation result is replaced with the maximum saturation value, and if the accumulation result is less than the minimum saturation value, the accumulation result is replaced with the minimum saturation value.
Continuing from the above, in the same cycle, the fast reduction circuit provides the bytes HW3_0, HW3_1, HW2_0, HW2_1, HW1_0, HW1_1, HW0_0, HW0_1 to the multiplexer MUX12 as the data HW. The fast reduction circuit folds the bytes HW3_0, HW3_1, HW2_0, HW2_1 and loads them in parallel into a 4-to-2 SIMD carry save adder compressor (4to2CSA1), to compress four input bytes into two output bytes, and 0s are added to perform zero-extension (that is, 16′b0 in
In the same cycle, the multiplexer MUX12 loads one of the data HW or the data W′ into the bytes W1_0, W1_1, W0_0, and W0_1 based on the sub-element length SELEN (equivalent to the element length ELEN). The fast reduction circuit folds the bytes W1_0, W1_1, W0_0, and W0_1 and loads them in parallel into a 4-to-2 SIMD carry save adder compressor (4to2CSA3), to compress four input bytes into two output bytes, and 0s are added to perform zero-extension (that is, 32′b0 in
Then, in the same cycle, the multiplexer MUX13 has different operation modes based on a received control signal RED. Specifically, based on the control signal RED, when the operation is vector reduction, the sub-element length SELEN of the multiplexers MUX11 and MUX12 is equivalent to the element length ELEN, and the multiplexer MUX13 always selects the data DW′ as the output. Moreover, based on the control signal RED, when the operation is element reduction, the multiplexer MUX13 selects one of the data HW′, W′, or DW′ based on the element length ELEN to be loaded into the bytes DW_0 and DW_1. Next, the single-instruction-multiple-data adder SIMD_ADDER accumulates the byte DW_0 and the byte DW_1 based on the element length ELEN to generate the fast reduction output FOUT.
It should be mentioned that, in
In other embodiments, vector reduction operations may also be applied to vector dot product reduction. Specifically, dot product reduction may perform fast element-wise multiplication between source elements and then accumulate the result into a destination scalar element. It should be noted that, in the present embodiment, the definition of the dot product is, for example, multiplying each element VS1[E*] in the operand VS1 and each element VS2[E*] in the operand VS2 to obtain a product element MUL[E*] (MUL[E*]=VS1[E*]×VS2[E*]), the first element MUL[E0] of the product element is added to the operand VS3[E0] (that is, VD[E0]) to obtain a multiply-accumulate element MAC[E0] (MAC[E0]=VS1[E0]×VS2[E0]+VS3[E0]), and the other elements MUL[E*] of the product element are added to the operand 0 to obtain the multiply-accumulate element MAC[E*] (the value thereof is equivalent to MUL[E*], MAC[E*]=VS1[E*]×VS2[E*]+0). In particular, when the unit vector length multiplier LMUL′ is equal to 1, all multiply-accumulate elements MAC[E*] are directly accumulated (that is, ΣMAC[E*]) after the first iteration is completed. When the unit vector length multiplier LMUL′ is greater than 1, the intermediate value (that is, multiply-accumulate element MAC[E*]) is loaded to the source input ACC[E*] after each iteration, and in the next iteration, the multiplication result of the operand VS1[E*] multiplied with the operand VS2[E*] is added to the source input ACC[E*] (that is, MAC[E*]′=VS1[E*]′VS2[E*]′+ACC[E*]), until all iterations are completed, and then accumulate all the elements inside the source input ACC[E*] (that is, ΣACC[E*]).
In other embodiments, the vector reduction operation may also be applied to huge-wide SIMD width. For example, the data path length (DLEN) may be 2048 bits, and the number of lanes may be equal to 2048/64=32. In the present embodiment, the number of iterations of the Lanes Reduction State of the vector reduction operation is 5. In other words, compared to reducing 4 lanes to 1 lane in
Based on the above, the vector processor of the invention may execute different steps in the reduction operation with the same circuit based on the state parameters, thereby saving circuit area and improving reduction operation performance. Moreover, the vector processor may perform the vector reduction operation and the element reduction operation with the same circuit structure, so as to further save circuit area. Moreover, in the invention, the number of iterations may also be flexibly adjusted based on the unit vector length multiplier to handle applications with larger data path lengths or vector register lengths, and a normal reduction operation or a fast reduction operation may be implemented when the element length is less than the length of a single lane for flexible design based on actual needs, so as to optimize the hardware performance index or software performance index.
Although the invention has been described with reference to the above embodiments, it will be apparent to one of ordinary skill in the art that modifications to the described embodiments may be made without departing from the spirit of the disclosure. Accordingly, the scope of the disclosure is defined by the attached claims not by the above detailed descriptions.