a) to 5(f) are diagrams schematically showing the procedure of transferring 96-bit data in a cache memory 3 to a length register 4;
a) is a diagram showing data in the length register 4 immediately after the data transfer of 128 bits (before a cyclic shift), and
a) to 14(d) are diagrams schematically showing the procedure of transferring 96-bit data in a cache memory 3 to a length register 4;
a) to 18(f) are diagrams showing the values of the length register 4 after data transfer;
a) to 26(d) are diagrams showing the values of a length register 4 after data transfer;
a) to 35(e) are diagrams showing the values of a length register 4 after data transfer;
a) and 37(b) are diagrams showing an example of the data transfer for the rectangular area 10 composed of 2 rows in which one row has 2 bytes;
a) and 39(b) are diagrams showing an example of the data transfer for the rectangular area 10 composed of 2 rows in which one row has 3 bytes.
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
The data transfer apparatus in
The data transfer apparatus according to the present embodiment can transfer data of an arbitrary length equal to or less than the data width of the length register from the cache memory 3 to the length register 4. The length register has a data area n times (n is an integral number of 1 or more) the read unit (e.g., 32 bits) of the cache memory 3. An example will be described below in which sequential 96-bit data in the cache memory 3 is transferred to the length register 4. The present embodiment is characterized in that data can be transferred from the cache memory 3 to the length register 4 by one instruction even if the initial address of data to be transferred is not located at the boundary of 32 bits.
The start address register 11 stores a start address indicating the position of the head of the data to be transferred. The transfer count register 12 stores the number of bytes for a data transfer. When 96-bit data is transferred, 12 (bytes) is stored in the transfer count register 12.
The memory address register 16 stores a memory address which is a read address of the cache memory 3. The present transfer count register 17 stores the number of remaining bytes to be transferred. The length register access location register 18 stores an address indicating a location in the length register 4 which is accessed.
The transfer count generator 22 calculates a difference value between a current memory address and a breakpoint address of 32 bits. When the start address of the data read from the cache memory 3 is not located at the breakpoint of 32 bits, the transfer count generator 22 outputs a difference value between the start address and the breakpoint address immediately thereafter. Subsequently, data is read from the cache memory 3 at every breakpoint of 32 bits, so that the transfer count generator 22 outputs “4” corresponding to 4 bytes.
The adder 19 generates an address in which the difference value stored in the transfer count generator 22 is added to the memory address stored in the memory address register 16. When the start address of the data to be transferred is not located at the boundary of 32 bits, data transfer is started from this start address. However, when the next data transfer is carried out, the difference value up to the boundary of 32 bits is added to the start address, so that an address corresponding to the position of the boundary is output from the adder 19. Then, the adder 19 sequentially outputs values in which 4 is added to the memory address.
The subtracter 20 generates a value in which the difference value stored in the transfer count generator 22 is subtracted from the number of untransferred bytes stored in the present transfer count register 17. When the start address of the data to be transferred is not located at the boundary of 32 bits, a value is generated which is obtained by subtracting the number of bytes from the start address to the breakpoint address immediately after the start address. Subsequently, values are sequentially output in which “4” is subtracted from the number of untransferred bytes stored in the present transfer count register 17.
Each of the multiplexers 13 to 15 selects and outputs one of two input signals in accordance with the logic of a control signal from the central controller 23. The logic of this control signal is switched at the start of data transfer.
More specifically, the multiplexer 13 selects the start address stored in the start address register 11 at the start of data transfer, and then selects the output of the adder 19. Thus, the output of the multiplexer 13 generally increases by four bytes every time data is transferred. The output of the multiplexer 13 is stored in the memory address register 16.
The multiplexer 14 selects a total transfer amount stored in the transfer count register 12 at the start of the data transfer, and then selects the output of the subtracter 20. Thus, the output of the multiplexer 14 decreases by one every time data is transferred. The output of the multiplexer 14 is stored in the present transfer count register 17.
The central controller 23 has a start address low bit column register 24, an original transfer count register 25 and a cache request enable register 26. The central controller 23 stores data to these registers in accordance with the start address supplied from the outside.
The start address low bit column register 24 stores the value of low 2 bits of the start address. The value of the low 2 bits makes it possible to detect a difference value between the start address and the breakpoint address of 32 bits.
The original transfer count register 25 stores as is the total transfer amount set in the transfer count register 12. The cache request enable register 26 stores an access request enable signal for instructing the cache memory 3 to transfer data.
Here, the instruction to transfer data is, for example, LDQW (R0) V0. This instruction instructs to load 96-bit data from the address indicated by a register R0 into a length register 4V0. The processing operation in
On receipt of an instruction to start data transfer from the decoder 2 (step S1), the controller 5 initializes the memory address register 16, the present transfer count register 17 and the length register access location register 18 (step S2). More specifically, the start address stored in the start address register 11 is stored in the memory address register 16, and the total transfer amount (12 in this case) stored in the transfer count register 12 is stored in the present transfer count register 17. The length register access location register 18 is initialized to 0.
The length register 4 is divided into four by 32 bits (128 bits in total), and indices such as 0, 1, 2, 3 are assigned to the bits in descending order. For example, the index 0 indicates 127th bit to 96th bit of the length register 4. The values of these indices 0 to 3 are stored in the length register access location register 18. The values stored in this register are the access locations in the length register 4.
Next, the controller 5 makes a request to access the cache memory 3 (step S3), and then waits until a cache access is finished (step S4). Then, the data read from the cache memory 3 is written into an address position indicated by the length register access location register 18 in the length register 4 (step S5).
Next, an amount corresponding to the output of the transfer count generator 22 is subtracted from the value stored in the present transfer count register 17 (step S6). This value indicates the number of untransferred bytes.
Next, the controller 5 judges whether transfers have been finished for the number of transfers stored in the transfer count register 12 (step S7). When the controller 5 judges in step S7 that the transfer has not been finished yet, the value of the memory address register 16 is increased to the address of the boundary position of the next 32 bits (step S8).
Next, the controller 5 judges whether the data transfer by a new memory address set in step S8 is the last data transfer and whether or not the amount of remaining data transfer (the amount of remaining transfer) is equal to or less than the number of bytes indicated by low 2 bits of the start address (step S9).
When the judgment in step S9 results in no, that is, when the data transfer is not the last data transfer or the amount of remaining transfer is greater than the number of bytes indicated by the low 2 bits of the start address, the controller 5 increases the length register access location register 18 by one (step S10), and returns to step S3. On the other hand, when the judgment in step S9 results in yes, that is, when the data transfer is the last data transfer and the amount of remaining transfer is equal to or less than the number of bytes indicated by the low 2 bits of the start address, the controller 5 initializes the length register access location register 18 to 0 (step S11), and returns to step S3.
Thus, the processing in step S11 is performed only when the data previously written is not overwritten even if data is rewritten from the head position of the length register access location register 18. This condition of performing no overwrite corresponds to the case where the amount of remaining transfer is equal to or less than the number of bytes indicated by the low 2 bits of the start address.
When the controller 5 judges in step S7 that transfers have been finished for the number of transfers stored in the transfer count register 12, the length register 4 is cyclically shifted in accordance with the value of the low 2 bits of the start necessary address (step S12).
The start address of the transfer data in
After the first data transfer has been finished, the value of the present transfer count register 17 is decreased by one to 11. The transfer count generator 22 calculates the value “1” of a difference between the start address and the boundary of the following 32 bits, and adds this difference value “1” in the adder 19, and then updates the memory address register 16 to 0X1000—0004.
Since the updated memory address 0X1000—0004 is not the last data, the value of the length register access location register 18 is increased by one in the adder 21, and a request to access the cache memory 3 is made again. Then, this time, “4, 5, 6, 7” of four bytes are read at a time starting from the one-byte data “4” in the cache memory 3 and stored in the length register 4 (
After the data transfer up to “7” has been finished, the value of the present transfer count register 17 is decreased by “4” to “7”. Since the initial address of the preceding data transfer is at the breakpoint of 32 bits, the transfer count generator 22 outputs “4” up to the next breakpoint. Then, “4” is added to the value of the memory address register 16, and the memory address is updated to 0X1000—8. Further, “1” is added to the value of the length register access location register 18, and the value of the length register access location register 18 becomes 2.
Subsequently, the next 32-bit data “8, 9, a, b” are transferred (
The next data transfer is the last one, and data to be transferred are remaining 24-bit data “c, d, e”. In this case, the judgment in step S9 in
Therefore, in the case of
This completes the data transfer from the cache memory 3 to the length register 4. Next, the order changing computing unit 7 in
The order changing computing unit 7 performs the cyclic shift in accordance with the cyclic shift amount and the cyclic shift range. When the data before the cyclic shift is as shown in
Although the transfer of the 96-bit data has been described with
The cyclic shift amount is “3” and the cyclic shift range is 16 bytes in the case of
As described above, in the first embodiment, even if the start address of the transfer data deviates from the position of the boundary of 32 bits of the cache memory 3, the data transfer from the cache memory 3 to the length register 4 can be indicated by only one instruction, so that the number of instructions can be reduced. Moreover, the transfer processing when the start address deviates from the position of the boundary of 32 bits is performed by hardware, and it is therefore not necessary to consider on the software whether the start address of the transfer data deviates from the position of the boundary of 32 bits, thereby making it possible to reduce overhead required for the operation.
While the example has been described in the first embodiment where the start address of transfer data deviates from the position of the boundary of 32 bits, the internal configuration of the controller 5 can be simplified and the processing operation of the data transfer apparatus becomes simpler if the start address of the transfer data is always located at the boundary of 32 bits. Thus, in a second embodiment below, a data transfer apparatus will be described in the case where the start address of the transfer data is always located at the boundary of 32 bits.
The data transfer apparatus in
In the configuration of the controller 5 in
“4” is added to a memory address register 16 in an adder 19 every time data is transferred. “4” is also subtracted from a present transfer count register 17 in a subtracter 20 every time data is transferred.
Next, the controller 5 makes a request to access a cache memory 3 (step S23), and then waits until data of four bytes is read from the cache memory 3 (step S24).
Next, the read data is written into the position indicated by the value of the length register access location register 18 in a length register 4 (step S25). Then, “4” is subtracted from the value of the present transfer count register 17 (step S26).
Next, the controller 5 judges whether all the data transfers have been finished (step S27). If all the data transfers have not been finished yet, “4” is added to the value of the memory address register 16, and “1” is added to the value of the length register access location register 18 (step S28). Then, the processing after step S23 is carried out.
On the other hand, if the controller 5 judges in step S27 that all the data transfers have been finished, the processing in
b) represents the value of the length register 4 after the first data transfer. First 32 bits are stored at the position of a value 0 in the length register access location register 18. In the same manner,
Thus, in the second embodiment, sequential data having a width larger than the read unit of the cache memory 3 can be transferred to the length register 4 by one instruction without the necessity of indicating the data transfer by a plurality of instructions, such that software processing can be simplified. Moreover, as the data transfer processing is performed by hardware, data can be transferred at an extremely high velocity.
In a third embodiment, data in a rectangular area within a cache memory 3 is transferred to a length register 4.
The controller 5 in
The inter-row memory address amount setting register 31 stores the address of a difference between adjacent rows in the rectangular area to be transferred. The row width register 32 stores the row width in the rectangular area. The row count register 33 stores the number of rows in the rectangular area.
On receipt of such an instruction to start data transfer from a decoder 2 (step S41), the controller 5 stores in a memory address register 16 the start address stored in a start address register 11, stores in the row count register 38 the number of rows stored in the row count register 33, stores in the in-row transfer amount register 37 the row width stored in the row width register 32, and stores in the inter-row memory address position register 34 the difference address stored in the inter-row memory address position register 34, and the controller 5 initializes a length register access location register 18 to 0 (step S42).
Next, the controller 5 sends to the cache memory 3 a request to read from a start address 0X1000—0000 in the memory address register 16 (step S43). In response to this, the cache memory 3 reads data of 32 bits from 0X1000—0000 in the same manner as the normal load instruction. The controller 5 waits until the reading of the data of 32 bits from the cache memory 3 finishes (step S44).
When the reading of the data of 32 bits is finished, the read data is stored in a position in the length register 4 indicated by the value (in this case, 0) stored in the length register access location register 18 (step S45).
Next, the number of transferred valid data bytes is subtracted from the value of the in-row transfer amount register 37 (step S46).
Next, it is judged whether data transfer for one row in the rectangular area has been finished (step S47). If it has not been finished yet, the value of the memory address register 16 is updated to the position of the boundary of the next 32 bits (step S48).
Next, it is judged whether the data transfer corresponding to the updated value of the memory address register 16 is the last data transfer of the row and whether the amount of remaining data transfer (the amount of remaining transfer) is equal to or less than the number of bytes indicated by low 2 bits of the start address (step S49). If it is not the last data transfer or if the amount of remaining transfer is greater than the number of bytes indicated by the low 2 bits of the start address, the length register access location register 18 is increased by one (step S50), and the processing after step S43 is carried out.
On the other hand, when the judgment in step S49 results in yes, that is, when the data transfer is the last data transfer and the amount of remaining transfer is equal to or less than the number of bytes indicated by the low 2 bits of the start address, the length register access location register 18 is initialized to “0” (step S51), and a return is made to step S43.
Thus, the processing in step S51 is performed only when the data previously written is not overwritten even if data is rewritten from the head position of the length register access location register 18. This condition of performing no overwrite corresponds to the case where the amount of remaining transfer is equal to or less than the number of bytes indicated by the low 2 bits of the start address.
When it is judged in step S47 that the data transfer for one row has been finished, “1” is subtracted from the row count register 38 (step S52).
Next, it is judged whether the data transfers for all the rows in the rectangular area have been finished (step S53). If not, the value of the memory address register 16 is updated to a value to which the value of the inter-row memory address position register 34 is added. Then, the in-row transfer amount register 37 is initialized, and the value of the length register access location register 18 is initialized to row width/4 (step S54). Then, the processing after step S43 is repeated.
On the other hand, when it is judged in step S53 that all the data transfers have been finished, the cyclic shift is carried out in accordance with the value of the low 2 bits of the start address of the transfer data (step S55), and all the processing is finished (step S56).
Before the start of data transfer, 0X1000—0003 is stored in the start address register 11, 4 (bytes) is stored in the row width register 32, “4” is stored in the row count register 33, and 0X0000—0100 is stored in the inter-row memory address amount setting register 31.
The setting of these registers may be carried out by issuing an instruction such as a store instruction or control register write instruction by software or may be carried out by using some hardware. When a load instruction targeting the length register 4 as a destination is decoded, the information is sent to the controller 5, and the controller 5 starts operation.
The controller 5 makes a request to read from an address 0X1000—0000 to the cache memory 3. The cache memory 3 reads data by 32 bits (4 bytes), and the read data is stored in a position in the length register 4 indicated by the length register access location register 18 (in this case, 0) (
Valid data in 32 bits of the address 0X1000—0000 is 1 byte of an address 0X1000—0003. Therefore, after the reading of the data of 1 byte, “1” is subtracted from the value of the in-row transfer amount register 37. First 3 bytes in the length register 4 will be overwritten later, so that any data may be stored at this moment.
When the first data transfer is finished, the memory address is updated to 0X1000—0004. Since the data transfer with this address is the last data transfer in the row, the value of the length register access location register 18 is set to the head position 0 of the row.
Furthermore, mask processing is performed by a mask controller 6 during the last data transfer in the row. In the case of the rectangular area 10 in
When such mask processing is performed, 3-byte data of “1, 2, 3” are stored before “0” in the length register 4, as shown in
This completes the data transfer for one row, and the row count register 38 decreases by one to 3. When this register is not 0, it means that untransferred rows are remaining. Therefore, the memory address register 16 is updated to a value 0X1000—0103 to which the value of an inter-row memory register is added. Then, the in-row transfer amount register 37 is initialized to 4, and the length register access location register 18 is updated to a value (in this case, 1) to which 1 is added, and then an access request is made to the cache memory 3.
The data transfer for the second row of the rectangular area 10 in
Subsequently, similar processing is performed for the third and fourth rows of the rectangular area 10. When the data transfers up to the fourth row are finished, the value of the row count register 38 becomes 0, and the data transfer is finished.
Then, the order changing computing unit 7 shown in
The order changing computing unit 7 cyclically shifts the length register 4 to the left by 32 bytes on a 32-bit basis in accordance with the cyclic shift amount selected in
Although the example has been described with
Therefore, when the length register 4 in
Thus, in the third embodiment, the data in the rectangular area 10 located at an arbitrary portion within the cache memory 3 can be transferred to the length register 4 in a simple manner and at a high velocity. In particular, in the third embodiment, a simple instruction is issued so that the data can be transferred by hardware at a high velocity even if the start address of the rectangular area 10 is not located at the position of the boundary of 32 bits.
While the example has been described in the third embodiment in which the start address of the transfer data in the rectangular form deviates from the position of the boundary of 32 bits, the internal configuration of the controller 5 can be simplified and the processing operation of the data transfer apparatus becomes simpler if the start address of transfer data is always located at the boundary of 32 bits. Thus, in a fourth embodiment below, a data transfer apparatus will be described in the case where the start address of the transfer data in the rectangular form is always located at the boundary of 32 bits.
The controller 5 in
Next, the read data is written into the position indicated by the value of a length register access location register 18 in a length register 4 (step S65). Then, “4” is subtracted from the value of an in-row transfer amount register 37 (step S66), and it is judged whether the data transfer for one row in the rectangular area 10 has been finished (step S67).
When the data transfer for one row has not been finished yet, “4” is added to the value of a memory address register 16 (step S68), and “1” is added to the value of the length register access location register 18 (step S69), and then the processing after step S63 is carried out.
When it is judged in step S67 that the data transfer for one row has been finished, “1” is subtracted from the value of a row count register (step S70), and it is judged whether the data transfers for all the rows in the rectangular area 10 have been finished (step S71). If not, the value of the memory address register 16 is set to a value to which the value of an inter-row memory address position register 34 is added, and the in-row transfer amount register 37 is initialized (step S72).
On the other hand, when it is judged in step S71 that the transfers of all the rows in the rectangular area 10 have been finished, the data transfer processing in
Next, 32-bit data in the second row in the rectangular area 10 is read and stored in the length register 4 (
Thus, in the fourth embodiment, the data in the rectangular area 10 in the cache memory 3 is transferred by hardware, so that the velocity of the data transfer processing can be increased. Moreover, the transfer of the data in the rectangular area 10 can be indicated by only one instruction, so that the burden on a programmer can be reduced.
In a fifth embodiment, transposition processing for exchanging a column with a row in a length register 4 is carried out after data in a rectangular area 10 in a cache memory 3 has been transferred to the length register 4.
The internal configuration of a controller 5 shown in
In step S56, data are rearranged in the length register 4 after the cyclic shift in accordance with the row width of the rectangular area 10.
Thus, in the fifth embodiment, the transposition processing is specified by one instruction and carried out by hardware, so that overhead required for matrix operation can be lower than when the transposition processing is carried out by a normal instruction set.
While the transposition processing has been described in the fifth embodiment in the case where the start address of the rectangular area 10 is not located at the boundary of 32 bits, the transposition processing can also be performed after the cyclic shift in the case where the start address of the rectangular area 10 is located at the boundary of 32 bits (fourth embodiment).
When the fourth data transfer is finished, the transposition processing is performed as shown in
While the example has been described with
Furthermore,
Thus, in the sixth embodiment, the transposition processing can be performed in hardware to transfer the rectangular area 10 in the cache memory 3 to the length register 4 and rearrange the rows and columns of the rectangular area 10, so that the data transfer and the transposition processing can be indicated by a simple instruction, and an increased velocity of the processing and the simplification of the instruction can be achieved.
While the examples have been described in the above embodiments in which data is transferred from the cache memory 3 to the length register 4, the memory from which data is transferred does not necessarily have to be the cache memory 3, and various memories from which data stored therein can be read are applicable to such a memory.
Number | Date | Country | Kind |
---|---|---|---|
2006-259159 | Sep 2006 | JP | national |