The present application is based upon and claims the benefit of priority from the prior Japanese Patent Application No. 2013-191570 filed on Sep. 17, 2013, with the Japanese Patent Office, the entire contents of which are incorporated herein by reference.
The disclosures herein relate to a data supply circuit, an arithmetic processing circuit, and a data supply method.
A large number of matrix computations are performed in signal processing for wireless communication. Especially, the LTE (long term evolution)-advanced that is expected to be a next generation high-speed signal processing system for wireless communication has matrix computations accounting for a significant proportion in its total computation. Because of this, the use of a typical CPU (central processing system) alone may not be sufficient to complete a desired computation within a desired processing time since such a CPU is not suited for complex computations such as matrix computation.
In general, a circumstance that requires performing a process with a heavy computational load such as a matrix computation is coped with by employing a dedicated circuit for such a process. The configuration that uses a dedicated circuit, however, cannot cope with even a slight change in the processing method. When universal applicability is taken into account, a SIMD (i.e., single instruction multiple data) architecture is suited to deal with array data as used in matrix computations.
In the SIMD-type architecture, generally, a unit of data may be 32-bit scalar data. In the case of a system in which the SIMD width is four, a vector having a length of 4 in which 4 scalar data are arranged side by side is used, and the four elements of the vector are processed in parallel to perform high-speed computation. Such a SIMD-type architecture generally employs a unit data length of 32 bits, a SIMD width of 4, and a data processing width P of 128 (=4×32), for example.
Processors based on a stream (array) processing architecture that can handle not only scalar data but also a matrix and a vector as a data unit have been under development. In such a processor based on the stream processing architecture, a hardware configuration may be arranged such that the unit data length and SIMD width are treated as variable parameters, thereby making it possible to define instructions for various unit data lengths. In this hardware configuration, a unit data length UL and a SIMD width SIMD define a data processing width P (=UL×SIMD) that varies depending on the computation instruction.
According to an aspect of the embodiment, a data supply circuit includes a buffer configured to store a plurality of data items each having a first width, a memory access unit configured to read source data stored in memory and to store the source data as one or more data items each having the first width in the buffer, and a selection control unit configured to repeat multiple times an operation of reading a data item having a second width shorter than or equal to the first width to read a plurality of data items each having the second width contiguously and sequentially from the buffer and configured to continue to read from a head end of the source data upon a read portion reaching a tail end of the source data.
The object and advantages of the embodiment will be realized and attained by means of the elements and combinations particularly pointed out in the claims. It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention, as claimed.
In the following, embodiments of the invention will be described with reference to the accompanying drawings.
In
The RF unit 10 down-converts the frequency of a radio signal received by an antenna 14, and converts the down-converted analog signal to a digital signal for transmission to a bus 13. The RF unit 10 converts a digital signal supplied through the bus 13 into an analog signal, and up-converts the analog signal into a radio-frequency signal for transmission through the antenna 14.
The dedicated hardware 11 includes a turbo unit for handling error correction codes, a viterbi unit for performing a viterbi algorithm, a MIMO (i.e., multi input multi output) unit for transmitting and receiving data through a plurality of antennas, and so on.
Each of the DSPs 12-1 through 12-3 includes a processor 21, a program memory 35, a peripheral circuit 23, and a data memory 30. The processor 21 includes a CPU 25 and a matrix processing processor 26. Various processes of the wireless communication signal processing such as a searcher process (synchronization), a demodulator process (demodulation), a decoder process (decoding), a codec process (coding), a modulator process (modulation), and the like are assigned to the DSPs 12-1 through 12-3.
The arithmetic processing circuit includes the data memory 30, a data supply circuit 31, an arithmetic data path (i.e., data arithmetic unit) 32, a data store circuit 33, an instruction decoder 34, and an instruction memory 35. The data supply circuit 31 is connected to the data memory 30, and reads data from the data memory 30. The arithmetic data path 32 is connected to the data supply circuit 31, and performs an arithmetic operation with respect to the data supplied from the data supply circuit 31. The data store circuit 33 is connected to the arithmetic data path 32 and to the data memory 30, and writes to the data memory 30 the resultant data of the arithmetic operation supplied from the arithmetic data path 32. The instruction memory 35 stores an instruction series comprised of a plurality of instructions, which are successively supplied to the instruction decoder 34. The instruction decoder 34 decodes supplied instructions to control the data supply circuit 31, the arithmetic data path 32, and the data store circuit 33 according to the decode results, thereby causing access to be made to the data memory 30 and arithmetic operations to be performed by the arithmetic data path 32.
In the example illustrated in
In the arithmetic data path 32, the SIMD width and the arithmetic unit length UL may be variables which can be set. Namely, the SIMD width and the arithmetic unit length UL may be different in arithmetic operations on an instruction-by-instruction basis.
The data length of the source data, i.e., the total length of the source data subjected to arithmetic operations, is referred to as a stream length SLS. When the arithmetic unit is a 2×2 real-number matrix (i.e., the arithmetic unit length UL is 4 shorts) and 1000 matrices are subjected to arithmetic operations, for example, the stream length SLS is 4000 shorts.
According to the result of decoding the instruction “opecode=mul” fetched from the instruction memory 35 (see
The address at which the storing of the destination data dst starts in the memory 30 is Z. The data length of the destination data dst is 1000 matrices as counted in arithmetic units. The instruction codes “dst addr=Z” and “dst length=1000” indicating these are supplied to the data store circuit 33, which, in response thereto, successively writes 20 matrices to start address Z and subsequent addresses.
Since the data length of the destination data dst is 1000 matrices, i.e., the data length of arithmetic operation outputs is 1000 matrices, matrix arithmetic operations by the arithmetic data path 32 are performed until 1000 matrices are output. As for the first source data src0, a total data length of 1000 matrices is equal to the data length of arithmetic operation outputs. Accordingly, it suffices for the data supply circuit 31 to successively read matrix data of the first source data src0 from the first matrix to the last matrix and to supply these matrix data to the arithmetic data path 32. As for the second source data src1, a total data length of 20 matrices is shorter than the data length of arithmetic operation outputs. Accordingly, the data supply circuit 31 successively reads matrix data of the second source data src1 from the first matrix to the last matrix, followed by returning to the first matrix to repeat successively reading matrix data from the first matrix to the last matrix. In this manner, the data supply circuit 31 repeats the operation of successively reading 20 matrices to supply the retrieved data to the arithmetic data path 32. When the number of repetitions of reading the second source data src1 reaches 50, the total number of retrieved matrices is 1000, which is equal to 20 matrices multiplied by 50 times. With this, the read operation comes to an end.
As another example, the data length of the first source data src0 may be 1000 matrices, and the data length of the second source data src1 is 20 matrices, with the data length of the destination data dst being 2000 matrices. In this case, the data supply circuit 31 successively reads matrix data of the first source data src0 from the first matrix to the last matrix, followed by returning to the first matrix to repeat successively reading matrix data from the first matrix to the last matrix. When the number of repetitions of reading the first source data src0 reaches 2, the total number of retrieved matrices is 2000, which is equal to 1000 matrices multiplied by 2 times. With this, the read operation comes to an end. When the number of repetitions of reading the second source data src1 reaches 100, the total number of retrieved matrices is 2000, which is equal to 20 matrices multiplied by 100 times. With this, the read operation comes to an end.
In
The selection control unit 42 includes a data selecting unit 45 and a control circuit 46. The selection control unit 42 successively repeats the operation of reading data having a width P by selecting P (≦M) (short) consecutive unit data items from the buffer queue 41, thereby reading data items each having the width P contiguously and sequentially from the buffer queue 41. Specifically, the selection control unit 42 first selects P (≦M) (short) consecutive unit data items sequentially from the top of the M unit data items having the width M that were most early stored in the buffer queue 41. The selection control unit 42 may supply the P selected unit data items to the arithmetic data path 32. In the case of the data transfer width being fixed (e.g., width M) between the selection control unit 42 and the arithmetic data path 32, the selection control unit 42 may supply data having the width M inclusive of the P selected unit data items to the arithmetic data path 32. The M-P unit data items other than the P selected unit data items may be any data whose value does not matter.
After selecting the P consecutive unit data items, the selection control unit 42 newly selects P consecutive unit data items sequentially from the unit data item next following the last unit data item that was already selected, and supplies the P newly selected unit data items to the arithmetic data path 32. Repeating the above-noted operation, the selection control unit 42 successively reads a plurality of data items each having the width P contiguously from the buffer queue 41. At some point, a unit data item selected by the selection control unit 42 may be the last unit data item of the data having width M. In such a case, the next following data having the width M is retrieved from the buffer queue 41, followed by continuing to select the first unit data item and subsequent unit data items of this newly retrieved data having the width M.
In step S1 of
As long as the loaded data is not the last one of the source data having the stream length SLS, the loaded data having the width M are successively stored in the buffer queue 41. When the loaded data is the last one of the source data having the stream length SLS, the source data may be present only in part of the data having the width M retrieved through the bus. In such a case, the invalid field (i.e., the bit field where no source data is present) is removed. To be more specific, when there is an invalid field in data having the width M that include the last one of the source data having the stream length SLS, the head part of the source data that is read in the next one of the repetitive cycles is used to fill the invalid field.
In step S4, the selection control unit 42 supplies data to the arithmetic data path 32 by adjusting the speed of data consumption to the unit of P. Namely, the selection control unit 42 retrieves data of the width P from the buffer queue 41 in each arithmetic operation cycle to supply the retrieved data to the arithmetic data path 32. With this arrangement, data having the data processing width P subjected to an arithmetic operation is supplied in each arithmetic operation cycle from the data supply circuit 31 to the arithmetic data path 32.
In step S5, the arithmetic data path 32 performs an indicated arithmetic operation in accordance with the decode result obtained in step S1. Further, the data store circuit 33 stores the resultant data of the arithmetic operation in the data memory 30. In step S6, the memory access unit 40, for example, checks whether the processing of all the data of the stream length SLS is completed. In the case of the processing of all the data being not completed, the procedure goes back to step S3 for further execution of the subsequent steps.
The check as to whether the processing of all the stream data is completed may be dependent on the number of output data items of arithmetic operation results. As was previously described, when the data length of the first source data src0 is 1000 matrices, and the data length of the destination data dst is 2000 matrices, the first source data src0 is read twice. In such a case, all the data of the stream length SLS are read the first time, and are then read the second time in the case of SLS being longer than M. In this manner, in the operation of contiguously reading a plurality of data items each having the width P sequentially from a plurality of data items each having the width M stored in the buffer queue 41, the event that data reading reaches the end of the data of the data length SLS can trigger an action of continuing to read data from the head of the data of the data length SLS.
In the case of the check in step S6 indicating that the processing of all the data is completed, the procedure for the instruction decoded in step S1 comes to an end.
In the case of the check in step S2 indicating that SLS is shorter than or equal to M, in step S7, the memory access unit 40 loads data of the width M only once, and pushes the loaded data into the FIFO of the buffer queue 41. Namely, the memory access unit 40 stores the data having the width M inclusive of the data of the stream length SLS only once in the buffer. Since SLS is shorter than or equal to M, only one load and push operation serves to store all the source data in the buffer queue 41.
In step S4, the selection control unit 42 supplies data to the arithmetic data path 32 by copying the data and adjusting the speed of data consumption to the unit of P. Namely, the selection control unit 42 retrieves data of the width P from the buffer queue 41 in each arithmetic operation cycle to supply the retrieved data to the arithmetic data path 32. To be more specific, the selection control unit 42 successively reads a plurality of data items each having the width P contiguously (i.e., without any gap) from a data portion of the one data item of the width M stored in the buffer queue 41 wherein the noted data portion corresponds to the data of the stream length SLS. When reading reaches the end of the data portion, the selection control unit 42 continues to read data from the head (i.e., start point) of the data portion. For example, Q (<P) unit data items may be selected at the end of the data portion that corresponds to the data of the stream length SLS. In such a case, further P-Q unit data items are selected sequentially from the head of such a data portion, and these P-Q unit data items are placed to follow the Q unit data items to create data of P unit data items. With this arrangement, data having the data processing width P subjected to an arithmetic operation is supplied in each arithmetic operation cycle from the data supply circuit 31 to the arithmetic data path 32.
In step S9, the arithmetic data path 32 performs an indicated arithmetic operation in accordance with the decode result obtained in step S1. Further, the data store circuit 33 stores the resultant data of the arithmetic operation in the data memory 30. In step S10, the memory access unit 40, for example, checks whether the processing of all the data of the stream length SLS is completed. In the case of the processing of all the data being not completed, the procedure goes back to step S8 for further execution of the subsequent steps. In the case of the check in step S10 indicating that the processing of all the data is completed, the procedure for the instruction decoded in step S1 comes to an end.
It may be noted that in the case of SLS being shorter than or equal to M, the memory access unit 40 loads data of the width M only once. The fact that it suffices to load data only once results in reduced power consumption.
As illustrated in FIG. 7-(a), data of the stream length SLS is stored in the data memory 30. The stream length SLS is longer than the width M. The data of the stream length SLS are read by the memory access unit 40 such that data of the width M is read at a time for storage in the buffer queue 41. FIG. 7-(b) illustrates data 51 stored in the buffer queue 41. The operation of reading data having the width P by selecting P (≦M) consecutive unit data items from the data stored in the buffer queue 41 is repeated multiple times, thereby reading data items 61 through 64 each having the width P contiguously and sequentially from the buffer queue 41. The data item 65 reaches the end of the data 51. Before retrieving the data item 65 having the width P, the memory access unit 40 reads data of the stream length SLS from the data memory 30 to store this read data as data 52 in the buffer queue 41. With this arrangement, a plurality of data items 61 through 69 each having the width P can be read contiguously and sequentially from the buffer queue 41. Each of the data items 61 through 69 having the width P is read in a different arithmetic operation cycle. That is, one data item is read in one arithmetic operation cycle.
In the example of an operation illustrated in
As illustrated in FIG. 8-(a), data of the stream length SLS is stored in the data memory 30. The stream length SLS is shorter than the width M. The data of the stream length SLS are loaded by the memory access unit 40 as data of the width M for storage in the buffer queue 41. FIG. 8-(b) illustrates data 70 stored in the buffer queue 41. The operation of reading data having the width P by selecting P (≦M) consecutive unit data items from the data stored in the buffer queue 41 is repeated multiple times, thereby reading data items 71 through 75 each having the width P contiguously and sequentially from the buffer queue 41. Since the data item 73 having the width P reaches the end of the data 70, the reading operation returns to the head of the data 70 to continue to select and read data from the head of the data 70. The same applies in the case of the data 75 having the width P. With this arrangement, a plurality of data items 71 through 75 each having the width P can be read contiguously and sequentially from the buffer queue 41. Each of the data items 71 through 75 having the width P is read in a different arithmetic operation cycle. That is, one data item is read in one arithmetic operation cycle.
The data of the width M (32 shorts in this example) that was most early stored in the buffer queue 41 is retrieved from the buffer queue 41, in response to the “1” state of a POP signal, to be stored in the buffer circuit 82 through the selector circuit 81. At this time, the selector circuit 81 is set in the state to select the input on the right-hand side in response to the “1” state of the POP signal. With the data having a width of 32 being stored in the buffer circuit 82, the 32-short-wide data being output from the buffer queue 41 (i.e., the 32-short-wide data that was most early stored as of this moment) is the next data following the data stored in the buffer circuit 82.
In response to the “1” state of the POP signal, the memory access unit 40 may read from the data memory 30 a remaining portion of the data of the stream length SLS that is not yet stored in the buffer queue 41, thereby storing the read data in the buffer queue 41 as succeeding data. In so doing, the data read from the data memory 30 may reach the end of the data of the stream length SLS. In such a case, reading may resume from the head portion of the data of the stream length SLS in response to the next “1” state of the POP signal. In this case, as illustrated in FIG. 7-(b), data may be stored in the buffer queue 41 such that the head portion of the data of the stream length SLS follows, without a gap, the end of the data of the stream length SLS that was previously stored.
The combining circuit 83 outputs 64-short-wide data BUFOUT obtained by placing, side by side, 32-short-wide data stored in the buffer circuit 82 and next 32-short-wide data output from the buffer queue 41. The length of the data BUFOUT is 64 shorts×16 bits, which is equal to 1024 bits.
The selector circuit 84 selects P consecutive unit data items from the 64-short-wide data BUFOUT output from the combining circuit 83 as specified by selection control signals SEL00 through SEL31 that are supplied from the control circuit 46. In actuality, the output of the data selecting unit 45 is 32 shorts in width. The P selected consecutive unit data items may be situated in a contiguous part (typically in the leftmost contiguous part) of the 32-short-wide output data. The arithmetic data path 32 performs an arithmetic operation only with respect to data having the data processing width P. Accordingly, the P consecutive unit data items situated in the leftmost part, for example, of the 32-short-wide data output from the data selecting unit 45 are subjected to such an operation.
Specifically, the selector 84-1 selects and outputs, from the 64-short-wide data BUFOUT, the 1-short-wide unit data item situated at the position that is specified by the selection control signal SEL00. Further, the selector 84-2 selects and outputs, from the 64-short-wide data BUFOUT, the 1-short-wide unit data item situated at the position that is specified by the selection control signal SEL01. Similarly, the selector 84-32 selects and outputs, from the 64-short-wide data BUFOUT, the 1-short-wide unit data item situated at the position that is specified by the selection control signal SEL31.
32 unit data items situated at the head of the data having a stream length SLS of 34 is stored in the buffer circuit 82 illustrated in
In the first cycle (cycle=0), the selection control signals SEL00 through SEL07 are 0 through 7, respectively, so that the 0-th unit data item (i.e., leftmost item) through the 7-th unit data item (i.e., eighth item from the left) are selected from the 64-short-wide data BUFOUT. In the next cycle (cycle=1), the selection control signals SEL00 through SEL07 are 8 through 15, respectively, so that the 8-th unit data item (i.e., ninth item from the left) through the 15-th unit data item (i.e., sixteenth item from the left) are selected from the 64-short-wide data BUFOUT. Thereafter, cycles proceed similarly, such that data items each having the width P are selected and read contiguously and sequentially by utilizing the buffer circuit 82.
In the fifth cycle (cycle=4), the selection control signals SEL00 through SEL07 are 32 through 39, respectively, so that the 32-th unit data item through the 39-th unit data item are selected from the 64-short-wide data BUFOUT. At this time, the POP signal is set to “1”. Accordingly, in the next following cycle, the 2 unit data items at the end of the data having a stream length SLS of 34 and the first 30 unit data items subsequent thereto are stored in the buffer circuit 82 illustrated in
In the sixth cycle, the selection control signals SEL00 through SEL07 are 8 through 15, respectively, so that the 8-th unit data item (i.e., ninth item from the left) through the 15-th unit data item (i.e., sixteenth item from the left) are selected from the 64-short-wide data BUFOUT. Thereafter, cycles proceed similarly, such that data items each having the width P are selected and read contiguously and sequentially.
32 unit data items situated at the head of the data having a stream length SLS of 34 is stored in the buffer circuit 82 illustrated in
In the first cycle (cycle=0), the selection control signals SEL00 through SEL31 are 0 through 31, respectively, so that the 0-th unit data item (i.e., leftmost item) through the 31-th unit data item (i.e., rightmost item) are selected from the 64-short-wide data BUFOUT. At this time, the POP signal is set to “1”. Accordingly, in the next following cycle, the 2 unit data items at the end of the data having a stream length SLS of 34 and the first 30 unit data items subsequent thereto are stored in the buffer circuit 82 illustrated in
In the next cycle (cycle=1) also, the selection control signals SEL00 through SEL31 are 0 through 31, respectively, so that the 0-th unit data item (i.e., leftmost item) through the 31-th unit data item (i.e., rightmost item) are selected from the 64-short-wide data BUFOUT. At this time, the POP signal is set to “1”. Accordingly, in the next following cycle, the 4 unit data items at the end of the data having a stream length SLS of 34 and the first 28 unit data items subsequent thereto are stored in the buffer circuit 82 illustrated in
At the beginning, the 12 unit data items of the data having a stream length SLS of 12 are stored without a gap therebetween in the leftmost side of the buffer circuit 82 illustrated in
In the first cycle (cycle=0), the selection control signals SEL00 through SEL07 are 0 through 7, respectively, so that the 0-th unit data item (i.e., leftmost item) through the 7-th unit data item (i.e., eighth item from the left) are selected from the 64-short-wide data BUFOUT. In the next cycle (cycle=1), the selection control signals SEL00 through SEL07 are 8, 9, 10, 11, 0, 1, 2, and 3, respectively. Accordingly, the 8-th unit data item (i.e., ninth item from the left) through the 11-th unit data item (i.e., twelfth item from the left) and, subsequent thereto, the 0-th unit data item (i.e. leftmost item) through the 3-rd unit data item (i.e., fourth item from the left) of the 64-short-wide data BUFOUT are selected. Thereafter, cycles proceed similarly, such that data items each having the width P are selected and read contiguously and sequentially by utilizing the buffer circuit 82. In this read operation, the stream length SLS is shorter than the width M, so that the POP signal is never set to “1”.
A description will be given of an example of the operation of the control circuit 46 by referring to
In the example illustrated in
The selector circuit 97 receives SLS_MOD_NEXT output from each of the SEL_WRAP circuits 93-1 through 93-32. The selector circuit 97 further receives the value obtained by subtracting “1” from the data processing width P, i.e., “7” in this example, as a selection control signal. The selector circuit 97 selects the SLS_MOD_NEXT signal having a value of “8” output from the 7-th, as counted when the starting number is “0”, SEL_WRAP circuit 93-8 (i.e., having the eighth ordinal position). The selector circuit 97 supplies the selected value to the SLS_MOD circuit 91. With this configuration, the SLS_MOD signal stored in the SLS_MOD circuit 91 becomes “8” in the next cycle.
In the ADD_OFFSET circuit 95 illustrated in
A description will be given of another example of the operation of the control circuit 46 by referring to
In the example illustrated in
In the example illustrated in
The selector circuit 97 receives SLS_MOD_NEXT output from each of the SEL_WRAP circuits 93-1 through 93-32. The selector circuit 97 further receives the value obtained by subtracting “1” from the data processing width P, i.e., “7” in this example, as a selection control signal. The selector circuit 97 selects the SLS_MOD_NEXT signal having a value of “8” output from the 7-th, as counted when the starting number is “0”, SEL_WRAP circuit 93-8 (i.e., having the eighth ordinal position). The selector circuit 97 supplies the selected value to the SLS_MOD circuit 91. With this configuration, the SLS_MOD signal stored in the SLS_MOD circuit 91 becomes “8” in the next cycle.
In the ADD_OFFSET circuit 95 illustrated in
The shifter circuit 127 illustrated in
In
In the case of SLS being shorter than or equal to M, the output of the SLS check circuit 121 is set to “1”, which causes the selector circuit 122 to select and output the value of the stream length SLS. As a result, in the case of the stream length SLS being “12” as illustrated in
In the control circuit 46 illustrated in
The arithmetic processing circuit illustrated in
Further, the present invention is not limited to these embodiments, but various variations and modifications may be made without departing from the scope of the present invention.
For example, the description given in connection with
According to at least one embodiment, data retrieved from memory can be efficiently supplied to an arithmetic unit in response to the requested computation process.
All examples and conditional language recited herein are intended for pedagogical purposes to aid the reader in understanding the invention and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although the embodiment(s) of the present inventions have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2013-191570 | Sep 2013 | JP | national |