The present invention relates to a processor and an arithmetic processing device including the processor.
As an arithmetic processing device, an SIMD (single instruction multiple data) parallel arithmetic processing device has been known that applies a single instruction to a plurality of data columns and processes them in parallel. For example, Japanese Unexamined Patent Application Publication No. H11-296498 discloses a technology for executing reduction operations of a plurality of arithmetic units.
According to an embodiment of the present invention, a processor comprising: a plurality of arithmetic and logic units configured to operate in parallel with one another; and a first reduction circuit including a first adder configured to simultaneously add together a plurality of arithmetic results output from the plurality of arithmetic and logic units is provided.
According to an embodiment of the present invention, an arithmetic processing device comprising: a plurality of processors, each of the plurality of processors including a plurality of arithmetic and logic units configured to operate in parallel with one another and a first reduction circuit including a first adder configured to simultaneously add together a plurality of arithmetic results output from the plurality of arithmetic and logic units is provided.
According to an embodiment of the present invention, an arithmetic processing method including: performing a plurality of arithmetic operations and/or logic operations in parallel; and simultaneously adding together arithmetic results of the plurality of arithmetic operations and/or logic operations wherein when the number of the arithmetic results is 2n, (where n is an integer of 2 or greater), simultaneously adding together the arithmetic results is calculating 2n−1 addition results by adding together the 2n arithmetic results and repeating addition until n−1 becomes 0.
An embodiment of the present invention is described in detail below with reference to the drawings. The embodiment to be hereinafter prescribed is an example of an embodiment of the present invention, and the present invention is not limited to the embodiment. It should be noted that in the drawings that are referred to in the present embodiment, identical components or components having similar functions are given identical or similar signs, and a repeated description of them may be omitted.
The technology disclosed in Japanese Unexamined Patent Application Publication No. H11-296498, in which reduction operations of a plurality of arithmetic units are performed in sequence for each arithmetic unit, have been undesirably unable to successively perform reduction operations.
According to an embodiment to be described below, there is provided an arithmetic processing device that can simultaneously perform reduction operations of a plurality of arithmetic units.
The CPU interface 101 controls a connection between a CPU (not illustrated) and the arithmetic sections 103. Specifically, the CPU interface 101 receives, from the CPU, a program and data that indicate a procedure, and transmits the program and the data to the plurality of arithmetic sections 103.
The plurality of arithmetic sections 103 perform processing of data on the basis of the program and data received from the CPU via the CPU interface 101. A configuration of each arithmetic section 103 will be described later.
The memory section 105 includes an arbitration circuit (which corresponds to the after-mentioned arbitration circuit 221) and a memory (which corresponds to the after-mentioned memory 223). The memory is constituted of a RAM and retains arithmetic results yielded by the arithmetic sections 103.
As shown in
Each of the a ALUs of each of the processors #0 to #p-1 includes a multiplier, an adder, a register, a shifter, a saturator, and the like and performs an arithmetic operation and/or a logic operation. Of the a ALUs of each of the processors #0 to #p-1, the ALU #0 is hereinafter referred to as “first-stage ALU” and the ALU #a-1 is hereinafter referred to as “final-stage ALU”. The a ALUs of each processor are configured to operate in parallel with one another, and arithmetic results yielded by each separate ALU are simultaneously output in synchronization with a clock signal.
The first reduction circuit 201 is configured to reduce a plurality of arithmetic results output from the a ALUs. The first reduction circuit 201 includes an adder 203 (hereinafter referred to as “first adder 203”) that is configured to simultaneously add together a plurality of arithmetic results output from the a ALUs. That is, the first adder 203 is configured to simultaneously add up a arithmetic results respectively output from the a ALUs. The a ALUs may simultaneously output arithmetic results, respectively.
As shown in
Next, the arithmetic results obtained in S1 are read out from the data register in synchronization with an (n+1)th clock signal and added together (S2). Specifically, the arithmetic result generated by adding the arithmetic results from the ALUs #0 and #1 and the arithmetic result generated by adding the arithmetic results from the ALUs #2 and #3 are added together. At the same time, the arithmetic result generated by adding the arithmetic results from the ALUs #4 and #5 and the arithmetic result generated by adding the arithmetic results from the ALUs #6 and #7 are added together. The arithmetic results in S1 generated from the arithmetic results yielded by the ALU #8 and the subsequent ALUs are added together in a similar way. That is, in S2, the 32 arithmetic results obtained in S1 are simultaneously added together two by two in one clock cycle, so that sixteen arithmetic results are generated. The sixteen arithmetic results generated in S2 are temporarily stored in a data register (DR).
Next, the arithmetic results obtained in S2 are read out from the data register in synchronization with an (n+2)th clock signal and added together (S3). Specifically, the arithmetic result generated by adding the arithmetic results from the ALUs #0 to #3 and the arithmetic result generated by adding the arithmetic results from the ALUs #4 to #7 are added together. At the same time, the arithmetic result generated by adding the arithmetic results from the ALUs #8 to #11 and the arithmetic result generated by adding the arithmetic results from the ALUs #12 to #15 are added together. The arithmetic results generated from the arithmetic results yielded by the ALU #16 and the subsequent ALUs are added together in a similar way. That is, in S3, the sixteen arithmetic results obtained in S2 are simultaneously added together two by two in one clock cycle, so that eight arithmetic results are generated. The eight arithmetic results generated in S3 are temporarily stored in a data register (DR).
Next, the arithmetic results obtained in S3 are read out from the data register in synchronization with an (n+3)th clock signal and added together (S4). Specifically, the arithmetic result generated by adding the arithmetic results from the ALUs #0 to #7 and the arithmetic result generated by adding the arithmetic results from the ALUs #8 to #15 are added together. At the same time, the arithmetic result generated by adding the arithmetic results from the ALUs #16 to #23 and the arithmetic result generated by adding the arithmetic results from the ALUs #24 to #31 are added together. The arithmetic results generated from the arithmetic results yielded by the ALU #32 and the subsequent ALUs are added together in a similar way. That is, in S4, the eight arithmetic results obtained in S3 are simultaneously added together two by two in one clock cycle, so that four arithmetic results are generated. The four arithmetic results generated in S4 are temporarily stored in a data register (DR).
Next, the arithmetic results obtained in S4 are read out from the data register in synchronization with an (n+4)th clock signal and added together (S5). Specifically, the arithmetic result generated by adding the arithmetic results from the ALUs #0 to #15 and the arithmetic result generated by adding the arithmetic results from the ALUs #16 to #31 are added together. At the same time, the arithmetic result generated by adding the arithmetic results from the ALUs #32 to #47 and the arithmetic result generated by adding the arithmetic results from the ALUs #48 to #63 are added together. That is, in S5, the four arithmetic results obtained in S4 are simultaneously added together two by two in one clock cycle, so that two arithmetic results are generated. The two arithmetic results generated in S5 are temporarily stored in the data register (DR).
Next, the arithmetic results obtained in S5 are read out from the data register in synchronization with an (n+5)th clock signal and added together (S6). Specifically, the arithmetic result generated by adding the arithmetic results from the ALUs #0 to #31 and the arithmetic result generated by adding the arithmetic results from the ALUs #32 to #63 are added together. An arithmetic result generated in S6 is temporarily stored in a data register (DR), and is simultaneously output in synchronization with an (n+6)th clock signal.
As mentioned above, the first adder 203 is configured to simultaneously and successively add up a plurality of arithmetic results, output from the plurality of ALUs, whose number corresponds to the number of ALUs. This makes it possible to perform successive reduction operations unlike the conventional technology, in which reduction operations of a plurality of ALUs (arithmetic units) are performed in sequence one by one for each ALU. That is, this makes it possible to perform reduction operations by pipeline processing.
It should be noted that the numbers of steps of addition that the first adder 203 executes and clocks that are needed for addition are not limited by the aspect described with reference to
With continued reference to
The first shifter 205 receives an arithmetic result output from the first adder 203 and performs a shift operation on the arithmetic result from the first adder 203 thus received. The arithmetic result shifted by the first shifter 205 may be transmitted to the first rounder 207.
The first rounder 207 performs, on the arithmetic result thus shifted, a rounding process such as nearest neighbor rounding, rounding in a 0 direction, rounding to +∞, or rounding to −∞. The arithmetic result subjected to the rounding process may be transmitted to the first saturator 209. The first saturator 209 performs a saturation process on the arithmetic result subjected to the rounding process thus received.
Arithmetic results obtained in the first reduction circuits 201 of the processors #0 to #p-1 are simultaneously output from the processors #0 to #p-1, respectively, in synchronization with a clock signal. In a case where the first shifter 205, the first rounder 207, and the first saturator 209 are omitted from each of the processors #0 to #p-1, the arithmetic result yielded by the first adder 203 is output from each of the processors #0 to #p-1 as their arithmetic result respectively. In a case where the first reduction circuit 201 of each of the processors #0 to #p-1 include the first shifter 205, the first rounder 207, and/or the first saturator 209, an arithmetic result yielded by the first shifter 205, the first rounder 207, or the first saturator 209 may be output from each of the processors #0 to #p-1 as their arithmetic result respectively.
Further, the arithmetic result obtained in the first reduction circuit 201 may be transmitted to ALUs of the corresponding processor as needed.
Arithmetic results that are respectively output from the processors #0 to #p-1 are transmitted to the memory section 220. In so doing, the arithmetic results that are respectively output from the processors #0 to #p-1 are transmitted to the memory section 220 through the second reduction circuit 211, which receives arithmetic results that are respectively output from the p processors #0 to #p-1 and reduces the arithmetic results thus received. Alternatively, the arithmetic results that are respectively output from the processors #0 to #p-1 may be transmitted to the memory section 220 without passing through the second reduction circuit 211.
The second reduction circuit 211 may include a shifter 215 (hereinafter referred to as “second shifter 215”) that is configured to shift the arithmetic result yielded by the second adder 213, a rounder 217 (hereinafter referred to as “second rounder 217”) that is configured to perform a rounding process on the arithmetic result thus shifted, and a saturator 219 (hereinafter referred to as “second saturator 219”) that is configured to perform a saturation process on the arithmetic result subjected to the rounding process.
The second shifter 215 receives the arithmetic result output from the second adder 213 and performs a shift operation on the arithmetic result from the second adder 213 thus received. The arithmetic result shifted by the second shifter 215 may be transmitted to the second rounder 217.
The second rounder 217 performs, on the arithmetic result thus shifted, a rounding process such as nearest neighbor rounding, rounding in a 0 direction, rounding to +∞, or rounding to −∞. The arithmetic result subjected to the rounding process may be transmitted to the second saturator 219. The second saturator 219 performs a saturation process on the arithmetic result subjected to the rounding process thus received.
The arithmetic result obtained in the second reduction circuit 211 is transmitted to the arbitration circuit 221 of the memory section 220 and transmitted to the memory 223 through the arbitration circuit 221. In a case where the second shifter 215, the second rounder 217, and the second saturator 219 are omitted, the second reduction circuit 211 outputs, as its arithmetic result, the arithmetic result yielded by the second adder 213. In a case where the second reduction circuit 211 includes the second shifter 215, the second rounder 217, and/or the second saturator 219, the second reduction circuit 211 may output, as its arithmetic result, an arithmetic result yielded by the second shifter 215, the second rounder 217, or the second saturator 219.
Although the foregoing has described, with reference to
It should be noted that the arbitration circuit 221 may acquire arithmetic results retained in the memory 223 and transmit the arithmetic results thus acquired to the processors #0 to #p-1 in sequence.
It should be noted that the present invention is not limited to the embodiment described above but may be altered as appropriate without departing from the scope of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
2016-234306 | Dec 2016 | JP | national |
This application is a U.S. continuation application filed under 35 U.S.C. § 111(a), of International Application No. PCT/JP2017/042227, filed on Nov. 24, 2017, which claims priority to Japanese Patent Application No. 2016-234306, filed on Dec. 1, 2016, the disclosures of which are incorporated by reference.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/JP2017/042227 | Nov 2017 | US |
Child | 16427992 | US |