This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2017-182797, filed on Sep. 22, 2017, the entire contents of which are incorporated herein by reference.
The embodiments discussed herein are related to an arithmetic processing device and an arithmetic processing method.
Among the mechanical learning methods using artificial intelligence, in particular, a need for deep learning has been increasing. In the deep learning, various operations including a multiplication, a product-sum operation, and a vector multiplication are executed. Meanwhile, in the deep learning, a requirement for an individual operational accuracy is not as precise as a normal arithmetic processing. For example, in a normal arithmetic processing, a programmer develops a computer program such that an overflow is not generated as much as possible. Meanwhile, in the deep learning, it may be allowable that large values are saturated to some extent. This is because, in the deep learning, an adjustment of a coefficient (weight) at the time of a convolution operation of a plurality of input data pieces is a main processing, and thus, in many cases, extreme data among input data is not emphasized. Also, since a large amount of data pieces are repeatedly used to adjust a coefficient, when an adjustment of digits is performed according to the progress of learning, the data that was once saturated may also be reflected in the coefficient adjustment without being saturated.
Therefore, in consideration of such characteristics of deep learning, in order to reduce the chip area of an arithmetic processing device for deep learning and improve the power consumption performance, using an integer operation by a fixed-point number without using a floating-point number may be taken into consideration. This is because a circuit configuration may be more simplified by an integer operation by a fixed-point number, as compared to a floating-point number operation.
However, the fixed-point number has a narrow dynamic range of possible values, and thus an operational accuracy of the fixed-point number may be deteriorated as compared to that of the floating-point number. Accordingly, even in the deep learning, a consideration is required on the accuracy that allows the largest possible value to the smallest possible value to be expressed, that is, a consideration is required on valid digits. Thus, a technique in which the fixed-point number is expanded has been suggested.
For example, in a processing by a mixed fixed point, a unified decimal point position is not used for a program in its entirety, but a decimal point position (Q format) suitable for each variable is used. For example, a Q3.12 format defines data of 16 bits including 1 digit for a sign bit, 3 digits for an integer part, and 12 digits below a decimal point. In the mixed fixed point, it is possible to perform a processing by varying a decimal point position for each variable, that is, digits of an integer part and digits below a decimal point.
In another example, in a processing by a dynamic fixed point (a dynamic fixed-point number), a value range of a variable is acquired during execution, and a decimal point position is reviewed at a fixed timing. Accordingly, it may be said that in the mixed fixed-point operation and the dynamic fixed-point operation, aspects of the floating decimal point operation are added to the fixed-point operation that allows a simple processing as compared to the floating decimal point operation.
Also, there has been proposed a digital signal processor (DSP) that has a function for a program for executing a processing by a mixed fixed-point operation and a dynamic fixed-point operation. For example, there is a DSP that executes a block-shift designation operation instruction. In the block-shift designation operation instruction, an operation is performed with a bit width larger than a bit width of a variable, and a value from an operation result is extracted by shifting, and stored in a register for the variable. By this instruction, the shift amount S (e.g., −128 to 127) when the value is extracted from the operation result may be designated by an immediate value/general-purpose register. For example, when the DSP executes an instruction of Result=Saturate(((in1(operator)in2)>>S), 16), the operation result is shifted by S bits, and higher-order bits are saturated while lower-order 16 bits are left. When S≥0, the DSP arithmetically shifts the operation result to the right (that is, embeds a sign bit and shifts the operation result to the right), and then, removes the lower-order bits. Meanwhile, when S<0, the DSP arithmetically shifts the operation result to the left (e.g., maintains a sign bit, and shifts the operation result to the left), and removes the lower-order bits in the complement.
Related technologies are disclosed in, for example, Japanese Laid-Open Patent Publication No. 07-084975.
According to an aspect of the invention, an arithmetic processing device includes a memory, and a processor coupled to the memory and the processor configured to calculate input data of an operation target so as to obtain data of an operation result, generate statistical information data for indicating a bit distribution in the data of the operation result, extract attention area data with a first predetermined size from the statistical information data, based on specified position information, generate higher-order side summary data obtained by summarizing higher-order side data of the statistical information data except the attention area data into a second predetermined size, and generate lower-order side summary data obtained by summarizing lower-order side data of the statistical information data except the attention area data into a third predetermined size.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention, as claimed.
In the related art, a scheme is suggested to reduce an operational accuracy deterioration of a fixed-point operation. However, in the related art, an arithmetic processing device that performs an operation was not able to efficiently acquire determination materials used for reducing the operational accuracy deterioration of the fixed-point operation.
Hereinafter, descriptions will be made on an embodiment of a technology capable of improving an accuracy of a fixed-point number, and reducing a physical quantity and a power consumption of a circuit, with reference to the drawings.
In the present embodiment, a processor 10 of an information processing apparatus acquires statistical information related to a distribution of numerical values as an operation execution result and provides the statistical information to, for example, an application program. Here, the statistical information related to the distribution of numerical values refers to, for example, any one of the following (1) to (4), or a combination thereof. The application program executed by the information processing apparatus acquires the statistical information from the processor, thereby optimizing a decimal point position. According to the processing of the application program, the processor executes instructions for a dynamic fixed-point operation or a mixed fixed-point operation.
(1) Distribution of Unsigned Most Significant Bit Positions
In
The numerical values given to the horizontal axis in
The information processing apparatus may obtain a distribution of unsigned most significant bit positions during learning execution so as to immediately determine a proper shift amount in a dynamic fixed-point operation or a mixed fixed-point operation, that is, a proper fixed decimal point position. For example, the information processing apparatus may determine the fixed decimal point position such that a ratio of data to be saturated becomes a specified ratio or less. That is, in an example, the information processing apparatus may determine the fixed decimal point position by prioritizing that data is saturated to a predetermined extent rather than that an underflow of data is performed to a predetermined extent.
The distribution of the unsigned most significant bit positions is integrated within a predetermined register (also, referred to as a statistical information register) within the processor of the information processing apparatus. The processor executes commands such as reading and writing of distribution data from/to the corresponding statistical information register, and clearing of the statistical information register. Thus, in the statistical information register, distribution data on one or more fixed-point numbers that become command execution targets from the time of execution of a previous-time clear command to the present time is accumulated. The accumulated distribution data is read to a memory by a read command. The processor may execute a command to perform loading into the statistical information register, instead of the clear command, so that a value 0 may be loaded in the statistical information register.
(2) Distribution of Unsigned Least Significant Bit Positions
The distribution of unsigned least significant bit positions indicates a least significant bit position where a bit has a value different from a sign bit. For example, the least significant bit position indicates a bit position having the smallest index k among bits[k] different from a sign bit, bit[39], when bits are arranged from the most significant bit, bit[39], to the least significant bit, bit[0]. In the distribution of the unsigned least significant bit positions, least significant bits included in valid data are grasped.
(3) Maximum Value of Unsigned Most Significant Bit Positions
A maximum value of unsigned most significant bit positions is the maximum value among most significant bit positions having different values from a sign bit, with respect to one or more fixed point numbers that become command execution targets from the time of execution of a previous-time clear command to the present time. The information processing apparatus may use the maximum value of the unsigned most significant bit positions in determining a proper shift amount in a dynamic fixed-point operation, that is, a proper decimal point position.
The processor executes commands such as reading of the maximum value from the statistical information register, and clearing of the statistical information register. Therefore, in the statistical information register, maximum values from the execution of the previous-time clear command to the present time are accumulated, and the maximum values are read to the memory by a read command.
(4) Minimum Value of Unsigned Least Significant Bit Positions
A minimum value of unsigned least significant bit positions is the minimum value among least significant bit positions having different values from a sign bit, with respect to one or more fixed point numbers from the time of execution of a previous-time clear command to the present time. The information processing apparatus may use the minimum value of the unsigned least significant bit positions in determining a proper shift amount in a dynamic fixed-point operation, that is, a proper decimal point position.
The processor 10 executes commands such as reading and clearing of the minimum value from the statistical information register. Accordingly, in the statistical information register, the minimum values from the execution of the previous-time clear command to the present time are accumulated, and then, are read to the memory by a read command.
(Procedure 1) The information processing apparatus acquires statistical information with a current bit accuracy, and creates any one of above histograms (1) to (4). In the statistical information in the case of (3) and (4), an OR operation is performed on data of a flag string indicating collected unsigned bit positions (most significant bit positions, least significant bit positions) to create a frequency distribution with a maximum frequency of 1.
(Procedure 2) The information processing apparatus calculates, with respect to the above statistical information in (1), a bit accuracy at which the ratio of the number of overflowing data pieces with respect to the total number of data pieces in the histogram becomes a threshold rmax. Alternatively, the information processing apparatus calculates, with respect to the above statistical information (2), a bit accuracy at which the ratio of the number of underflowing data pieces with respect to the total number of data pieces in the histogram becomes a threshold rmax. In the case of the above statistical information (2) and (4), a bit accuracy is calculated by setting a threshold rmax to 0. That is, a bit accuracy is updated to match with the maximum value (minimum value) of unsigned most significant (least significant) bits.
(Procedure 3) An operation in the next period is performed with the calculated bit accuracy.
The processor 10Z includes a program counter (PC) 111Z, a decoder 112Z, a register file 12Z, a vector operation arithmetic unit 131Z, a scalar operation arithmetic unit (arithmetic logic unit (ALU)) 141Z, and an accumulator 132Z that adds the result of the vector operation arithmetic unit 131Z. Further, the processor 10Z includes a plurality of selectors 101Z that select operation results of, for example, the vector operation arithmetic unit 131Z, the scalar operation arithmetic unit 141Z, and the accumulator 132Z and read results from the data memory 22Z. In the drawing, a plurality of selectors are collectively referred to as the selector 101Z. A plurality of vector operation arithmetic units are collectively referred to as the arithmetic unit 131Z.
The processor 10Z includes a statistical information acquisition unit 102Z that acquires statistical information from data selected by the selector 101Z, and a statistical information storage 105Z that stores the statistical information acquired by the statistical information acquisition unit 102Z. In the drawing, a plurality of statistical information acquisition units are collectively referred to as the statistical information acquisition unit 102Z.
The processor 10Z includes a data converter 103Z that changes a fixed decimal point position of data selected by the selector 101Z. In the drawing, a plurality of data converters are collectively referred to as the data converter 103Z.
Referring to the drawing, an instruction is fetched from an address of the instruction memory 21Z indicated by the program counter 111Z, and the decoder 112Z decodes the fetched instruction. In the drawing, an instruction fetching controller that executes fetching of an instruction is omitted.
When the decoder 112Z decodes the instruction, respective units of the processor 10Z are controlled according to the decoded result. For example, when the decoded result is a vector operation instruction, data of a vector register of the register file 12Z is input to the vector operation arithmetic unit 131Z, and a vector operation is executed. The operation result of the vector operation arithmetic unit 131Z is supplied to the statistical information acquisition unit 102Z and the data converter 103Z through the selector 101Z. The operation result of the vector operation arithmetic unit 131Z is input to the accumulator 132Z, and the operation result of the vector operation arithmetic unit 131Z is added to, for example, a cascade. The operation result of the accumulator 132Z is supplied to the statistical information acquisition unit 102Z and the data converter 103Z through the selector 101Z.
For example, when, as a result of the decoding, the instruction is a scalar operation instruction, data of a scalar register of the register file 12Z is input to the scalar operation arithmetic unit 141Z. The operation result of the arithmetic unit 141Z is supplied to the statistical information acquisition unit 102Z and the data converter 103Z through the selector 101Z as in the operation result of the accumulator 132Z.
For example, when, as a result of decoding, the instruction is a load instruction, data is read from the data memory 22Z, and supplied to the statistical information acquisition unit 102Z and the data converter 103Z through the selector 101Z. The result of data conversion in the data converter 103Z is stored in the register of the register file 12Z.
When, as a result of decoding, the instruction is an instruction to execute a dynamic fixed-point operation, the decoder 112Z instructs a shift amount to be supplied to the data converter 103Z. The shift amount is acquired from, for example, an operand (immediate value) of an instruction, a register specified by an operand, and the data memory 22Z of an address indicated by an address register specified by an operand, etc., and is supplied to the data converter 103Z.
The data converter 103Z shifts fixed point number data obtained from, for example, the result of the vector operation, the result of the scalar operation, the operation result of the accumulator 132Z, or the result read from the data memory 22Z, by a specified shift amount S. The data converter 103Z executes not only shifting, but also a saturation processing of higher-order bits and a rounding of lower-order bits.
The rounding processor rounds the lower-order S bits as a decimal part. When S is negative, the rounding processor does not perform anything. As for the rounding, for example, rounding to nearest, rounding to 0, rounding to positive infinity, rounding to negative infinity, and random number rounding are exemplified. In the drawing, the shift amount is a shift amount acquired from the instruction by the decoder 112Z, for example, as illustrated in
Then, the data converter 103Z maintains a sign of a higher-order bit at the time of the left shift, and performs a saturation processing on bits other than the sign bit. That is, the data converter 103Z discards a higher-order bit, and embeds 0 into a lower-order bit. At the time of the right shift, the data converter 103Z embeds a sign bit into a higher-order bit (a bit at a lower order than a sign bit). Then, the data converter 103Z outputs data obtained as described above through, rounding, shifting, and saturation processing, with the same bit width (e.g., a register of 16 bits) as, for example, a register of the register file 12Z.
Accordingly, a computer program executed by the processor 10Z specifies a shift amount in an operand of an instruction that executes a dynamic fixed-point operation so that the processor 10Z updates a decimal point position of a fixed-point number by the specified shift amount during program execution.
As a result of decoding, when the instruction is an instruction to acquire statistical information (referred to as an instruction with a statistical information acquisition function), the statistical information is acquired by the statistical information acquisition unit 102Z, and is stored in the statistical information storage 105Z. Here, the statistical information, as described above, is (1) a distribution of unsigned most significant bit positions, (2) a distribution of unsigned least significant bit positions, (3) a maximum value of unsigned most significant bit positions, (4) a minimum value of unsigned least significant bit positions, or a combination thereof.
In the configuration example in
Hereinafter, among the statistical information acquisition units 102Z, a unit that acquires an unsigned most significant bit position will be referred to as a statistical information acquisition unit 102A. Among the statistical information acquisition units 102Z, a unit that acquires an unsigned least significant bit position will be referred to as a statistical information acquisition unit 102B. Among the statistical information aggregating units 104Z, a unit that counts bit positions acquired by the statistical information acquisition unit 102A and acquires a bit distribution with respect to the bit positions will be referred to as a statistical information aggregating unit 104A. Among the statistical information aggregating units 104Z, a unit that performs an OR operation on bit positions acquired by the statistical information acquisition unit 102B in a previous stage for acquiring a maximum value and a minimum value of bit positions will be referred to as a statistical information aggregating unit 104B.
That is, in this circuit, exclusive OR (EXOR) between the sign bit in[39] and other bits (in[0] to in[38]) is executed. Then, the exclusive OR value by a bit having the same value as the sign bit in[39] is 0, and the exclusive OR value by a bit having a different value from the sign bit in[39] is 1.
Here, for example, when in[0] and in[39] have different values, out[0] of data output by exclusive OR is 1. Meanwhile, the exclusive OR value of in[39] and in[1] is input to out[1] of the output data via an AND gate. To one input of the AND gate, a bit value obtained by inverting the exclusive OR value of in[39] and in[0] is input. Thus, when the exclusive OR value of in[39] and in[0] is 1, regardless of the exclusive OR value of in[39] and in[1], the output of the AND gate is 0.
Similarly, the exclusive OR value of in[39] and in[2] is input to out[2] of output data, via the same AND gate as above. To one input of the AND gate, a bit value obtained by inverting the logical sum (output of an OR gate) of two exclusive OR values, that is, the exclusive OR value of in[39] and in[0] and the exclusive OR value of in[39] and in[1], is input. Thus, when the exclusive OR value of in[39] and in[0] is 1, regardless of the exclusive OR value of in[39] and in[2], the output of the AND gate that outputs a value to out[2] of the output data is 0. Hereinafter, similarly, regardless of the exclusive OR value of in[39] and in[i] (i is 1 or more), the output of the AND gate that outputs a value to out[i] of the output data is 0.
Meanwhile, for example, when in[0] and in[39] have the same value, out[0] of data output by exclusive OR is 0. Thus, an AND gate to which the exclusive OR value of in[39] and in[1] is input outputs 1 or 0 depending on the exclusive OR value of in[39] and in[1]. Hereinafter, similarly, an input with logical NOT of the AND gate, from which out[i] (i is 1 or more) is output, is 0 when all the exclusive ORs of in[39] and in[j] (j is 0 or more, and i−1 or less) are 0. When the exclusive OR value of in[39] and in[i] (i is 1 or more) is 1, 1 is set to out[i]. 0 is set to output data out[i] at the higher order than the corresponding bit. Therefore, by the circuit of
That is, input data is exemplified as array data of 8 (rows)×40 (bits). The input data of 40 bits in each row is data of an unsigned most significant bit position (output of the statistical information acquisition unit 102A in
The input data may be set as an unsigned least significant bit position by the statistical information acquisition unit 102B (
In
In this processing, a result obtained through OR operations of each column in an array in[j][i] of input data, with respect to all the rows (j=0, . . . , 7), is input to 40-bit output data out[i] (i=0, . . . , 39). Accordingly, in the pseudo code of
Through the above configuration, the information processing apparatus of Comparative Example accumulates statistical information of each variable of each layer, in a register or a register file, for example, at the time of mini batch execution of deep learning. Then, the information processing apparatus in Comparative Example may update a decimal point position of each variable of each layer based on the accumulated statistical information. That is, the processor 10Z acquires statistical information of a bit distribution. Here, the statistical information includes, at the time of command execution, for example, (1) a distribution of unsigned most significant bit positions, (2) a distribution of unsigned least significant bit positions, (3) a maximum value of unsigned most significant bit positions, (4) a minimum value of unsigned least significant bit positions, or a combination thereof. Accordingly, when the information processing apparatus in Comparative Example executes deep learning, it is possible to implement a dynamic fixed-point operation in a practical time without overhead during the deep learning program for acquiring statistical information of data.
That is, the processor 10Z of the information processing apparatus of Comparative Example executes an instruction with a statistical information acquisition function, and executes an instruction to perform bit-shifting and rounding/saturation on an operation result and to store the result in a register. Accordingly, the information processing apparatus of Comparative Example may reduce the overhead for acquiring statistical information indicating a bit distribution. Further, it is possible to immediately determine a proper bit shift, that is, a decimal point position from the statistical information indicating a bit distribution.
However, in Comparative Example, the processor 10Z separately counts the statistical information at positions of all the bits (e.g., 40 bits) in the intermediate state of an operation. Accordingly, as illustrated in
[Statistical Information of Embodiment]
Hereinafter, a processor 10 of an information processing apparatus according to a first embodiment will be described (see
Here, the attention area is an area designated, for example, for the processor 10 from an application program by a user. Meanwhile, the attention area may be an area designated by the processor 10 by the processing of the application program. As the attention area, for example, higher and lower (2N−1) bits centered on a high frequency position, as the most significant bit position, in an numerical area (e.g., an area of single-precision 16 bits) that is expressible at the current accuracy may be designated. As the attention area, for example, higher and lower (2N−1) bits centered on a high frequency position, as the least significant bit position, in an area expressible at the current accuracy, may be designated. The attention area of (2N−1) bits will be referred to as a window. When the size of the attention area is set as (2N−1) bits, N is called a window size parameter.
In the present embodiment, the processor 10 extracts information in a 1-bit-to-1-bit correspondence within a window. Meanwhile, the processor 10 detects the presence or absence of a flag at the higher-order bit side (one bit) and the lower-order bit side (one bit), outside the window. Here, as described above, the flag indicates bit 1 indicating (1) an unsigned most significant bit position, (2) an unsigned least significant bit position, (3) a position of a maximum value of unsigned most significant bit positions, or (4) a position of a minimum value of unsigned least significant bit positions, in the statistical information.
The most significant bit in the statistical information is a bit which becomes 1 in a special case where there is no value different from a sign bit in input data as a statistical information acquisition target. The most significant bit in the statistical information stores a sign bit when there is a value different from a sign bit in input data as a statistical information acquisition target. That is, the most significant bit in the statistical information does not have a relationship such as a higher order or a lower order with respect to the attention area. Thus, the processor 10 extracts a sign bit in the input data as the statistical information acquisition target, as it is, and sets the sign bit as the most significant bit of the statistical information. In a special case where there is no value different from a sign bit in the input data as the statistical information acquisition target, the processor 10 sets 1 to the most significant bit of the statistical information.
[Instruction Format]
Hereinafter, descriptions will be made on an instruction format of an instruction to specify a user-specified bit position at the time of statistical information acquisition according to the present embodiment.
[First Instruction Format]
In an instruction format example 1, a function of designating a user-specified bit position is individually added to, for example, an operation instruction and a load instruction in which statistical information is acquired.
[Instruction Format Example 1.1]
For the command of vmul_su vs, vt, vd, imm, usr, vector registers vs and vt are multiplied, an imm bit shift is performed, rounding and saturation are performed, and the result is stored in a register vd outside the arithmetic unit. Statistical information in which the multiplication result is not yet shifted is acquired, and accumulated in the statistical information register. When the statistical information is acquired, (2N−1) centered on a usr bit are set as the attention area.
[Instruction Format Example 1.2]
For the command of vld_su rs, rt, rd, usr, vector data is loaded from an address obtained by adding address registers rs to rt, and is stored in a vector register rd. Statistical information of the loaded data is acquired and accumulated in the statistical information register. When the statistical information is acquired, (2N−1) centered on a usr bit are set as the attention area.
[Instruction Format Example 1.3]
For the command of read_acc_su rd, imm, usr, with respect to data of an accumulator⋅register (40 bits), an imm bit shift is performed, and rounding and saturation are performed, and the result is stored in a scalar register rd. The processor 10 acquires statistical information from data of the accumulator register, and accumulates the statistical information in the statistical information register. When the statistical information is acquired, (2N−1) centered on a usr bit are set as the attention area.
The attention area may not be the area of (2N−1) bits centered on the usr bit as long as the attention area is determined by the usr bit. For example, the processor 10 may set, for example, higher-order (2N−1) bits with respect to the usr bit, or lower-order (2N−1) bits with respect to the usr bit, as the attention area.
[Second Instruction Format]
[Third Instruction Format]
An instruction to specify an independent user-specified bit position is added.
[Instruction Format Example 3.1] set_usr usr
The processor 10 stores a value usr (user-specified bit position information) in a designated position holding register 34 (see
By implementing an instruction in the instruction format as described above, the processor 10 accepts designation of a user-specified bit position from an application program. Then, by the processor 10, statistical information after the execution of an operation may be acquired, summarized, and aggregated, and then, accumulated in the statistical information register. Then, for example, by a statistical information register read command, the processor 10 may deliver the statistical information to the application program.
[Circuit Configuration]
The configuration and the operation of the statistical information acquisition unit 102 are the same as those of the statistical information acquisition unit 102Z (102A and 102B) in the Comparative Example, and thus, descriptions thereof will be omitted. The statistical information acquisition unit 102 is an example of a generator that outputs statistical information data indicating a bit distribution in operation result data. Similarly to the statistical information acquisition unit 102Z in Comparative Example as illustrated in
The configuration and the operation of the statistical information aggregating unit 104 are the same as those of the statistical information aggregating unit 104Z (104A and 104B) in Comparative Example, and thus, descriptions thereof will be omitted. The statistical information aggregating unit 104 is an example of a statistical information aggregating unit that outputs statistical information aggregated data obtained by aggregating summary data output by a plurality of above arithmetic units.
The configuration and the operation of the statistical information storage 105 are the same as those of the statistical information storage 105Z (105A) in Comparative Example, and thus, descriptions thereof will be omitted. The statistical information storage 105 is an example of a statistical information storage that stores the statistical information aggregated data.
[Statistical Information Summarizing Unit]
As illustrated in
[Window Bit Extraction Circuit 31]
For example, assuming that N=8, and USR=31 when the window size is 15, the predetermined bit S=40-8−31=1, and then, the barrel shifter 311 performs a logical left shift on input 39 bits excluding a sign bit, by one bit. Then, the window bit extraction circuit 31 may extract higher-order 15 bits (=2N−1) from the shifted data. Through this configuration, the attention area with higher 7 (=N−1) bits and lower 7 bits centered on the USR=31, that is, 15 bits in total, is acquired. Meanwhile, in the present embodiment, the window size parameter N is not limited to 8. When the USR is designated by a bit number starting from bit 0, 39 bits not including a sign bit may be set as the number of input bits (B_WID). The predetermined bit S is calculated as S=number of input bits(39)−(window size parameter N+user-specified position USR). The window bit extraction circuit 31 is an example of an extractor that extracts attention area data with a first predetermined size based on specified position information.
[Higher-Order Bit-Side Summary Circuit 32]
As illustrated in
The higher-order side mask bit generator 322, to which the bit width is input from the SUB circuit 321, generates a higher-order side mask pattern in which 1 is set to higher-order bits for the input bit width. For example, when the SUB circuit 321 outputs a bit width of 3, the higher-order side mask bit generator 322 generates a higher-order side mask pattern in which three higher-order bits are 1 and other bits are 0, and outputs the higher-order side mask pattern to the higher-order side mask register 323.
The AND circuit 324 executes an AND operation of a bit string of input data and a bit string of the higher-order side mask register 323. The OR circuit 325 executes an OR operation between bits of a bit string which is a result of the AND operation between the bit string of the input data and the bit string of the higher-order side mask register 323. Accordingly, when bit 1 is included in a portion in the input data masked by the AND circuit 324 and the higher-order side mask register 323, the output of the OR circuit 325 becomes 1. Meanwhile, when bit 1 is not included in a portion in the input data masked by the AND circuit 324 and the higher-order side mask register 323, but all bits are 0, the output of the OR circuit 325 becomes 0.
That is, the higher-order bit-side summary circuit 32 extracts the bit string of the higher-order side summary area in the input data, through the mask pattern of the higher-order side mask register 323, summarizes the bit string into one bit by executing an OR operation between bits in the higher-order side summary area, and extracts the bit. That is, when at least one of 1 bit is included in all bits in the higher-order side summary area, the summarized value becomes 1. Meanwhile, when all the bits in the higher-order side summary area are 0, the summarized value becomes 0. The higher-order bit-side summary circuit 32 is an example of a higher-order side summarizing unit that outputs higher-order side summary data, which is obtained by summarizing higher-order side data, in data other than attention area data, into a second predetermined size. One bit is an example of the second predetermined size, by which the higher-order side summary area is summarized by the higher-order bit-side summary circuit 32. The AND circuit 324 is an example of a circuit that executes an AND operation between higher-order side summary area data as a summarizing target, in statistical information data, and higher-order side mask data generated based on specified position information. The OR circuit 325 is an example of a circuit that executes an OR operation between all the bits of first AND result data which is a result of the AND operation.
In the mask pattern circuit of
Accordingly, when mid[k] (k=1 to 39) in the decoder of
[Lower-Order Bit-Side Summary Circuit 33]
As illustrated in
The lower-order side mask bit generator 332, to which the bit width is input from the SUB circuit 331, generates a lower-order side mask pattern in which 1 is set to lower-order bits for the input bit width. For example, when the SUB circuit 331 outputs a bit width of 4, the lower-order side mask bit generator 332 generates a lower-order side mask pattern in which four lower-order bits are 1, and other bits are 0, and outputs the lower-order side mask pattern to the lower-order side mask register 333.
The AND circuit 334 executes an AND operation of a bit string of input data and a bit string of the lower-order side mask register 333. The OR circuit 335 executes an OR operation between bits of a bit string which is a result of the AND operation between the bit string of the input data and the bit string of the lower-order side mask register 333. Accordingly, when bit 1 is included in a portion in the input data masked by the AND circuit 334 and the lower-order side mask register 333, the output of the OR circuit 335 becomes 1. Meanwhile, when bit 1 is not included in a portion in the input data masked by the AND circuit 334 and the lower-order side mask register 333, but all bits are 0, the output of the OR circuit 335 becomes 0.
That is, the lower-order bit-side summary circuit 33 extracts the bit string of the lower-order side summary area in the input data, through the mask pattern of the lower-order side mask register 333, summarizes the bit string into one bit by executing an OR operation between bits in the lower-order side summary area, and extracts the bit. That is, when at least one of 1 bit is included in all bits in the lower-order side summary area, the summarized value becomes 1. Meanwhile, when all the bits in the lower-order side summary area are 0, the summarized value becomes 0. The lower-order bit-side summary circuit 33 is an example of a lower-order side summarizing unit that outputs lower-order side summary data which is obtained by summarizing lower-order side data, in data other than attention area data, into a third predetermined size. One bit is an example of the third predetermined size, by which the lower-order side summary area is summarized by the lower-order bit-side summary circuit 33. The AND circuit 334 is an example of a circuit that executes an AND operation between lower-order side summary area data as a summarizing target, in statistical information data, and lower-order side mask data generated based on specified position information. The OR circuit 335 is an example of a circuit that executes an OR operation between all the bits of second AND result data which is a result of the AND operation.
That is, in the mask pattern circuit of
Accordingly, when mid[k] (k=1 to 39) in the decoder of
That is, in the present embodiment, the statistical information acquisition unit 102 generates statistical information with the number of bits (e.g., 40 bits) within an arithmetic circuit. Then, the statistical information summarizing unit 30 keeps the most significant bit (one bit) and the attention area (2N−1 bits, e.g., 15 bits) of the statistical information acquired by the statistical information acquisition unit 102, and summarizes each of the higher-order side summary area and the lower-order side summary area into one bit. Therefore, the statistical information (e.g., 40 bits) before summarized is summarized into, for example, 18-bit summary information, that is, summarized statistical information.
The statistical information aggregating unit 104 aggregates the summarized statistical information in the same procedure as that of Comparative Example. The statistical information storage 105 stores the summarized statistical information in the same procedure as that of Comparative Example. In
An element that is regarded as MAX connected to the statistical information storage 105 is a circuit that selects the bit of a maximum value from a logical sum of accumulated unsigned most significant bit positions, in the statistical information stored in the statistical information storage 105, according to
As clearly seen from
As described above, in the first embodiment, the statistical information is summarized by the statistical information summarizing unit 30. As a result, subsequently to the statistical information summarizing unit 30, wirings are reduced by a reduction of a bit width of the statistical information aggregating unit 104 and the statistical information storage 105. For example, when the number of bits in the arithmetic circuit is 40 bits, and the summarized statistical information has 18 bits, it is expected that the number of gates and the number of wirings be almost halved.
The statistical information, as illustrated in
The statistical information acquisition unit 102 in the first embodiment generates statistical information in which any one of bits is 1, with respect to the operation result by the vector operation arithmetic unit 131, and the scalar operation arithmetic unit 141. Accordingly, the statistical information acquisition unit 102 may faithfully generate statistical information indicating (1) an unsigned most significant bit position, (2) an unsigned least significant bit position, (3) a position of a maximum value of unsigned most significant bit positions, and (4) a position of a minimum value of unsigned least significant bit positions. The statistical information may be summarized because any one of bits in the statistical information is 1, and low-frequency distributions are formed on both sides of the attention area as illustrated in
Since data in the attention area is shifted to left by the barrel shifter 311 based on specified position information, and then extracted, the statistical information acquisition unit 102 may acquire the data in the attention area through a simple circuit configuration.
The higher-order bit-side summary circuit 32 executes an AND operation through the AND circuit 324 between a bit string of the higher-order side summary area as a summarizing target and a mask pattern of the higher-order side mask register 323 generated based on the user-specified position USR. Then, the higher-order bit-side summary circuit 32 performs an OR operation through the OR circuit 325, between all the bits of the first AND result data which is a result of the AND operation. Accordingly, the higher-order bit-side summary circuit 32 may simply summarize a bit string in the higher-order side summary area through two logic gates.
The lower-order bit-side summary circuit 33 executes an AND operation through the AND circuit 334 between a bit string of the lower-order side summary area as a summarizing target and a mask pattern of the lower-order side mask register 333 generated based on the user-specified position USR. Then, the lower-order bit-side summary circuit 33 performs an OR operation through the OR circuit 335, between all the bits of the second AND result data which is a result of the AND operation. Accordingly, the lower-order bit-side summary circuit 33 may simply summarize a bit string in the lower-order side summary area through two logic gates.
With reference to
The register 42 extracts higher-order 2N−1 bits, from the statistical information shifted by the barrel shifter 41. The OR circuits 43 and 44 perform an OR operation of four bits at each of the higher side and the lower side in the register 42, into one bit. When the higher-side four bits are ORed into one bit, the one bit is an example of the fourth predetermined size. When the lower-side four bits are ORed into one bit, the one bit is an example of the fifth predetermined size. Through such a configuration, the attention area of the statistical information is summarized from 15 bits to 9 bits. The number of bits to be summarized at both sides of the attention area is not limited to four bits. Through the above described configuration, the attention area may be reduced. Therefore, the processor 10 in the second embodiment may further reduce the statistical information as compared to that in the first embodiment.
For example, when the number of bits of the statistical information is reduced from 40 bits to 12 bits, it may be expected that wirings from the statistical information summarizing unit 30 including the attention area summarizing unit 40 to the statistical information storage 105 are reduced to 12/40=0.3 by 70%. One vector unit and one scalar unit are assumed for eight SIMD pieces. It is assumed that a flip-flop is a D-type flip-flop, and the number of gates is 10. On such assumption, when the number of bits of the statistical information is reduced from 40 bits to 12 bits, it is estimated that the total number of gates is reduced by about 64%.
All examples and conditional language recited herein are intended for pedagogical purposes to aid the reader in understanding the invention and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to an illustrating of the superiority and inferiority of the invention. Although the embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2017-182797 | Sep 2017 | JP | national |