1. Field of the Invention
The present invention relates to a data processing device with memory composed of a plurality of blocks, and a method thereof for processing such memory data.
2. Description of the Related Art
Improvement in both the degree of integration and speed of large-scale integrated circuits (LSI), including micro-processors is remarkable. With the high speed of an LSI, its difference with external memory, such as a main storage and like has increased. In order to fill in up the difference, a method for mounting a cache memory with a large capacity (that is, a large area) on an LSI has become popular.
In small devices requiring data processing capability, including a cellular phone and a personal digital assistance (PDA), a processor and a main storage device are encapsulated in an LSI. It can be easily predicted that with the improvement of the degree of integration, the memory capacity of an LSI will go on increasing.
In the conventional memory control, all accesses to large-capacity memory mounted on an LSI are made by single latency (for example, see Patent References 1 and 2).
In this case, latency means time from when a data request is issued until requested data returns, and as the unit of latency, the number of cycles of a clock used for the synchronization of a circuit is used.
If single latency is used, no difference in latency occurs between an access to memory physically located remotely from a request source and an access to memory close to the request source. Main reasons for such control are as follows.
However, as with the advancement of the processing technology of semiconductors, the speed (clock frequency) of an LSI has further improved, the wiring delay time in an LSI has become dominant, and delay difference due to a difference in a position disposed in an LSI between the two segments of memory cannot be negligible. If in such a state, control is performed by single latency as ever, as a result, a wiring delay time obtained when the farthest memory is accessed cannot be helped being adopted. In that case, the latency of memory access becomes very long to affect process performance.
Each memory block comprises flip-flop circuits (FF) 21 and 22, random-access memory (RAM) 23 for storing data, and a selector 24.
Each of the FF 21 and 22 functions as a buffer circuit with one stage (one cycle). The selector 24 selects either an output path from the RAM 23 in the same block or an output path from another farther block, and outputs data from the selected path.
In this case, if the distance between the request source 11 and each block is converted into latency, the distance is expressed by the total number of FFs 21 included in both a path for transferring a data request issued from the request source 11 to the RAM 23 of the issuance destination and a path for transferring data outputted from the RAM 23 to the request source 11. In this example, the distances up to the blocks M1, M2, M3 and M4 are two, four, six and eight cycles, respectively.
If there is no difference in latency between blocks, the number of the farthest block M4 is adopted, and FFs 22 are added to the other blocks in such a way that the number of FFs of each block may become equal to the number of M4. Accordingly, the average latency becomes as follows.
Therefore, the process of a request to memory blocks other than M4 greatly delays.
It is an object of the present invention to provide a data processing device for improving memory access speed when large-capacity memory is mounted on a semiconductor integrated circuit, such as an LSI, and a method thereof.
The first data processing device of the present invention comprises a plurality of memory blocks, a plurality of transfer paths, and a selector.
Each of the plurality of memory blocks has different latency for each data request issued from the request source 11. Each memory block receives the data request and outputs requested data. Each of the plurality of transfer paths transfers data from these memory blocks to the request source. Then, the selector selects a transfer path from the issuance destination memory block of the data request to the request source, from the plurality of transfer paths.
The second data processing device of the present invention comprises a plurality of cache memory blocks, a control circuit, a plurality of tag transfer paths, a plurality of data transfer paths, a first selector and a second selector.
Each of the plurality of cache memory blocks includes a tag memory for receiving a data request issued from a request source and outputting the tag of the requested data, and a data memory for receiving the data request and outputting the requested data, has different data latency for a data request. The control circuit performs cache control using an outputted tag.
Each of the plurality of tag transfer paths transfers a tag from each cache memory block to the control circuit. Each of the plurality of data transfer paths transfers data from each cache memory block to the request source.
The first selector selects a tag transfer path from the issuance destination cache memory of the data request to the request source, from these tag transfer paths. The second selector selects a data transfer path from the issuance destination cache memory block of the data request to the request source, from these data transfer paths.
The preferred embodiments of the present invention are described in detail below with reference to the drawings.
In this preferred embodiment, memory in an LSI is divided into a plurality of blocks according to a latency difference so that a result can be returned to an access to a block with short latency (block located physically close to a request source). Thus, average latency is shortened by effectively using a latency difference, and accordingly, the performance of an LSI can be improved.
The configuration of the data processing device in this preferred embodiment can be largely classified into six configurations as shown in
An application configuration 33 can be obtained by adding a variable-length buffers to not only the block with the shortest latency, but also blocks with longer latency of the basic configuration 31. In this case, a variable-length buffer with plural stages capable of realizing the same latency as the longest latency, to each block.
Then, configurations 34, 35 and 36 indicate preferred embodiments whose configurations 31, 32 and 33, respectively, are also extended and applied to a cache memory.
In the cache memory basic configuration 34, data and tags in cache memory are divided into blocks according to a latency difference. A cache memory application configuration 35 can be obtained by adding a variable-length buffer with one stage to the block with the shortest latency of the cache memory basic configuration. A cache memory application configuration 36 can be obtained by adding a variable-length buffer with plural stages to each block.
The specific example of each configuration is described below with reference to
If the basic configuration 31 of the present invention is applied to the LSI shown in
The request source 41 corresponds to, for example, a main pipeline, an arithmetic unit and the like in a central processing unit (CPU). The request source 41 issues a data request to each block of the memory 42, and receives data from the memory 42 via an output bus 51. In this case, since memory control is performed using latency different for each block, there is no need for FFs 22.
Since the latency of blocks M1, M2, M3 and M4 is two cycles, four cycles, six cycles and eight cycles, respectively, the average latency of memory access becomes as follows.
Therefore, performance has improved by three cycles, compared with the case shown in
In this case too, the latency of blocks M1, M2, M3 and M4 are two, four, six and eight cycles, respectively, and their average latency becomes five cycles.
However, when data from a block with different latency is returned to the request source, attention must be paid to conflict in the output bus 51 due to a latency difference.
For example, as shown in
The simplest solution for suppressing this conflict is a method for delaying the issuance of the subsequent request R2 by one cycle as shown in
In order to realize such memory control, the following mechanism (circuit) is added to an LSI.
(a) Since latency is not fixed, an instruction mechanism for instructing a request source to transfer data asynchronously is needed. This instruction mechanism calculates the latency of each request according to an accessing block and notifies the request source that data in the output bus 51 is valid, according to the result.
(b) If there are consecutive requests to a plurality of blocks each with different latency, the conflict of data outputs in the output bus 51 must be avoided. For this purpose, in addition to the instruction mechanism mentioned above in (a) for calculating the latency of each request, a suppression mechanism for storing a request being currently executed in advance and suppressing (delaying) the issuance of a subsequent request if it is determined that there is output conflict, is needed.
The specific examples of these instruction mechanism and suppression mechanism are described later. In
In this case, since the latency of the issuance destination of request R3 is four cycles, data is outputted from memory 42 in a cycle 07, if request R3 is issued in a cycle 04, and there is no conflict with request R2. Nevertheless, since the issuance of request R2 delays, the issuance of the subsequent request R3 also delays, and actually data is outputted in a cycle 08. As a result, substantial latency is also affected and prolonged, and the entire throughput degrades.
Thus, it can be considered that data outputs from a plurality of blocks each with different latency is adjusted by adopting the application configuration 32 instead of the basic configuration 31 shown in
(c) A variable-length buffer with one stage is added to the output of a memory block with the shortest latency.
(d) For an access to the memory block with the shortest latency, the following two kinds of determination are simultaneously performed by extending the function of the suppression mechanism mentioned above in (b).
If there is no output conflict when no buffer is used, a transfer path is selected without using a buffer. However, if there is conflict when no buffer is used and there is no conflict when a buffer is used, a transfer path is selected using a buffer. If there is output conflict regardless of the existence/non-existence of a buffer, the issuance of a request is delayed.
For example, if a variable-length buffer is added to block M1 with the shortest latency (two cycles) in
If the path via the FF 54 is selected, data output from block M1 can be delayed by one cycle. Therefore, the latency of block M1 becomes variable in the range of 2 through 3 cycles.
Thus, the issuance of requests R2 and R3 and the latency of data shown in
The above-mentioned application configuration 32 is a limited countermeasure in which a variable-length buffer is added to only a memory block with the shortest latency in order to minimize the increase of devices. If the increase of devices is allowed, any situation can be coped with by further extending this configuration and preparing a variable-length buffer capable of fitting in up the difference with the longest latency, for all blocks except a memory block with the longest latency. The application configuration 33 shown in
In the application configuration 33, a variable-length buffer such that can prolong the latency of each memory block up to the same level as the longest latency, is added to each memory block. Thus, the adjustment range of latency can be expanded and performance degradation due to output conflict can be completely prevented.
For example, if such a variable-length buffer is added to each of blocks M1 through M3 in
As shown in
This variable-length buffer can set four buffer lengths of zero stages, two stages, four stages and six stages. These buffer lengths can delay data output by zero cycles, two cycles, four cycles and six cycles, respectively. In the case of zero stages, the selector 61 selects input I2, and in the case of two stages, the selectors 61 and 62 select inputs I1 and I4, respectively. In the case of four stages, the selectors 61, 62 and 63 select inputs I1, I3 and I6, respectively, and in the case of six stages, the selectors 61, 62 and 63 select inputs I1, I3 and I5, respectively.
As shown in
As shown in
By providing these variable-length buffers, the latency of blocks M1, M2 and M3 become variable in the range of two through eight cycles, four through eight cycles and six through eight cycles, respectively, and any block can realize eight cycles, which is the latency of block M4. Since the longest latency of the memory 42 is eight cycles, in any situation, there is no output conflict if data output is delayed at most by eight cycles.
The access signal A indicates that an access to the memory 42 can be performed in the case of logic “1”, and that the access cannot be performed in the case of logic “0”. The request source 41 delays the issuance of a request until the access signal A becomes logic “1”.
Block output selection signals O1 through O4 are used as the control signals of a selector 52. The selector 52 selects a transfer path from a block M1 when a signal Oi (i=1, 2, 3 and 4) becomes logic “1”.
A decoder 64 obtains the address of an issuance destination by decoding the request signal R, and outputs block selection signals S1 through S4. A signal Si (i=1, 2, 3 and 4) becomes logic “1” if the issuance destination is block Mi.
Signal S4 is inputted to a circuit in which eight FFs 54 are connected in series, and is outputted as signal O4 after eight cycles. The output of an AND circuit 65 becomes logic “1” if signal S3 is logic “1” and signal O4 is logic “0” after six cycles. The output of the AND circuit 65 is inputted to a circuit in which six FFs 54 are connected in series, and is outputted as signal O3 after six cycles.
The output of an AND circuit 66 becomes logic “1” if signal S2 is logic “1” and signals O3 and O4 both are logic “0” after four cycles. The output of the AND circuit 66 is inputted to a circuit in which four FFs 54 are connected in series, and is outputted as signal O2 after four cycles.
The output of an AND circuit 67 becomes logic “1” if signal S1 is logic “1” and signals O2, O3 and O4 all are logic “0” after two cycles. The output of the AND circuit 67 is inputted to a circuit in which two FFs 54 are connected in series, and is outputted as signal O1 after two cycles. Then, an OR circuit 68 outputs the logical sum of signal S4 and the outputs of the AND circuits 65 through 67 as an access signal A.
According to such an access input control circuit, a request whose issuance destination is block M4 is inputted to the memory without any processes. However, as to a request whose issuance destination other blocks than M4, it is checked whether there is data output conflict with a preceding request. If there is the conflict, the issuance of a request is suppressed.
(1) The block identification information of an access destination is obtained from the address of a request. For the block identification information, for example, a block number is used. If its block is known, its latency which is at least necessary is known. It is assumed that the latency is n cycles. 0 is set as the initial value of the number m of stages in use of a variable-length buffer.
(2) Whether the output bus 51 is vacant after (n+m) cycles is checked from the output buffer reservation information. If the output bus 51 is not vacant, the process described below in (3) is performed. If the output bus 51 is vacant, the process described below in (4) is performed.
(3) 2 is added to m and the process mentioned above in (2) is performed.
(4) The number of stages of a variable-length buffer of an access destination block is set to m, and data is accessed. The fact that data is outputted after (n+m) cycles is added to the output buffer reservation information and a subsequent request is awaited. Simultaneously, the obtained (n+m) cycle value is notified to the data-valid flag response circuit 71.
The data-valid flag response circuit 71 corresponds to an example of the above-mentioned instruction mechanism, and transfers a data-valid flag to the request source 41 after (n+m) cycles. Thus, the fact that data in the output bus 51 is valid after (n+m) cycles is notified to the request source 41.
A circuit in which eight FFs 54 are connected in series forms a preceding request display bit map and stores the output buffer reservation information. A timing signal OUT outputted from the FF 54 at the final stage becomes logic “1” in a cycle in which data is outputted.
Buffer stage number selection signals C1-0 through C1-6 are used as the control signals of the variable-length buffer 55 of block M1. When signal C1-i (i=0, 2, 4 and 6) is logic “1”, i-stages of buffer length is set in the variable-length buffer 55. However, in
Although, in
The output of an AND circuit 91 becomes logic “1” if the following two conditions are met.
The output of the AND circuit 91 is inputted to the second last FF 54, and is outputted as signal OUT after two cycles.
The output of an AND circuit 92 becomes logic “1” if the following three conditions are met.
The output of the AND circuit 92 is inputted to the third last FF 54, and is outputted as signal OUT after three cycles. An OR circuit 96 outputs the logical sum of the respective outputs of the AND circuits 91 and 92 as a buffer stage number selection signal C1-0.
According to such a circuit, if the output bus 51 is vacant after two cycles, the buffer length of the variable-length buffer 55 is set to zero stages. If the output bus 51 is vacant after three cycles even when the output bus 51 is not vacant after two cycles, the buffer length of the variable-length buffer 55 is set to zero stages. In this case, if the output of requested data is delayed by one cycle, there is no output conflict.
The output of an AND circuit 93 becomes logic “1” if the following four conditions are met.
An OR circuit 85 outputs the logical sum of the output of the AND circuit 93 and the outputs of the AND circuits, which are not shown, of the other blocks. The output of the OR circuit 85 is inputted to the fourth last FF 54, and is outputted as signal OUT after four cycles.
The output of an AND circuit 94 becomes logic “1” if the following five conditions are met.
An OR circuit 84 outputs the logical sum of the output of the AND circuit 94 and the outputs of the AND circuits, which are not shown, of the other blocks. The output of the OR circuit 84 is inputted to the fifth last FF 54, and is outputted as signal OUT after five cycles.
An OR circuit 97 outputs the logical sum of the respective outputs of the AND circuits 93 and 94 as a buffer stage number selection signal C1-2.
According to such a circuit, if the output bus 51 is vacant after four cycles, the buffer length of the variable-length buffer 55 is set to two stages. If the output bus 51 is vacant after five cycles even when the output bus 51 is not vacant after four cycles, the buffer length of the variable-length buffer 55 is set to two stages. In this case, if the output of requested data is delayed by one cycle, there is no output conflict.
The output of an AND circuit 95 becomes logic “1” if the following seven conditions are met.
An OR circuit 81 outputs the logical sum of the output of the AND circuits 95 and the outputs of the AND circuits for the other blocks, which are not shown in
According to such a circuit, if the output bus 51 is not vacant after two through seven cycles, the buffer length of the variable-length buffer 55 is set to six stages. In this case, since the latency becomes the longest eight cycles, there is no output conflict.
Similarly, OR circuits 82 and 83 outputs the logical sum of the respective outputs of the AND circuits which are not shown in
According to such a variable-length buffer stage number selection circuit 72, an optimal buffer length can be selected, according to the block number of an issuance destination and the data output timing of a preceding request. Therefore, the conflict of data outputs can be prevented while utilizing a latency difference between blocks.
The selectors 61, 62 and 63 are controlled by a selection signal C (corresponding to signals C1-0 through C1-6) from the variable-length buffer stage number selection circuit 72 in the same way as in the variable-length buffer shown in
The timing signal OUT shown in
In the configuration shown in
The configuration shown in
The above-mentioned basic configuration 31 and application configurations 32 and 33 are used for general memory. In the case of a cache memory, not only data but also a tag can have the same latency difference. A cache memory basic configuration 34 and cache memory application configurations 35 and 36 can be obtained by extending and applying the basic configuration 31 and application configurations 32 and 33, respectively, shown in
When applying the present invention to a cache memory in an LSI, the structure of a tag must be taken into consideration. If the amount of tags is small compared with data and the tags of all blocks can be disposed near the request source, the tags can be handled by the basic configuration 31 and application configurations 32 and 33. However, if the amount of tags is not negligibly small, the tags must be distributed and disposed. Therefore, the cache memory basic configuration 34 is applied to and used for a large capacity of cache memory by the addition of the following components/functions.
(e) Data is distributed and disposed for each cache line. Thus, both tags can also be distributed and disposed for each block.
(f) The suppression mechanism mentioned above in (b) is extended. If there is the conflict to the output bus of data outputs or there is the conflict of outputs from a tag, the issuance of a request is suppressed.
In cache memory, the validity of data, such as the hit/miss of a cache line and the like is determined using the output of a tag. If the suppression mechanism mentioned above in (f) is not provided, control logic for determining/processing tag output for each block is needed. For example, there is a possibility that a plurality of requests requiring an external access is caused by a cache miss. In such a case, new control and a new circuit for arbitrating those requests are needed. Therefore, control becomes easier if the suppression mechanism mentioned above in (f) is adopted.
Each cache memory block comprises an FF 21, tag RAM 111 and data RAM 112, and outputs tags and data, according to a request from the request source 41.
A selector 103 selects one of tag transfer paths from four blocks, and outputs the tag of the selected path to a cache control circuit 102. Upon receipt of the tag, the cache control circuit 102 performs the hit/miss determination of the tag, and controls the operation of the cache memory 101, according to the result of the determination. A selector 52 selects one of tag transfer paths from four blocks, and outputs the data of the selected path to the output bus 51.
Such a configuration in which the tag section and data section of cache are integrated has the following implementation advantages.
(1) Repeatability
Another cache memory block can be easily generated by duplicating one cache memory block.
(2) Localization of Delay Analysis
If delay analysis is applied to one cache memory block, the result of the analysis can be applied to another cache memory block.
In the configuration shown in
Here it is assumed as in
In order to prevent such performance degradation, the cache memory application configuration 35 is used. In this configuration, a variable-length buffer with one stage as in
If a variable-length buffer as in
In a variable-length buffer on the output side of the tag RAM 111, the selector 53 selects either a path for transferring data directly from the tag RAM 111 or a path transferring data via the FF 54. In a variable-length buffer on the output side of the data RAM 112, the selector 53 selects either a path for transferring data directly from the tag RAM 111 or a path transferring data via the FF 54.
According to such a configuration, scheduling shown in
In the cache memory application configuration 36, a variable-length buffer such that can prolong the latency of each cache memory block up to the longest latency is added to both tag output and data output from each cache memory block. Thus, any situation can be coped with, and the best average latency can be obtained.
For example, if such a variable-length buffer is added to each tag RAM 111 and data RAM 112 of blocks C1 through C3 in
On each output side of the tag RAM 111 and data RAM 112 of block C1, a variable-length buffer 55 is provided, and on each output side of the tag RAM 111 and data RAM 112 of block C2, a variable-length buffer 56 is provided. On each output side of the tag RAM 111 and data RAM 112 of block C3, a variable-length buffer 57 is provided.
The respective configurations and operations of the variable-length buffers 55, 56 and 57, the data-valid flag response circuit 71 and the variable-length buffer stage number selection circuit 72 are already described above. In this case, two variable-length buffers in each block are controlled by the same selection signal from the variable-length buffer stage number selection circuit 72, and the selectors 103 and 52 are also controlled by the same selection signal.
By providing these variable-length buffers, the tag latency of blocks C1, C2 and C3 become variable in the ranges of one to seven cycles, two to seven cycles and five to seven cycles, respectively, and any block can realize seven cycles, which is the tag latency of block C4. Since the longest tag latency of the cache memory 101 is seven cycles, there will be no conflict of tag output if tag output is delayed at most by seven cycles in any situation. The adjustment range of data latency is the same as in
In the configuration shown in
In this example, only a path for transferring a request from the CPU CORE 121 to the data RAM 112 of each block and a path for transferring data from each data RAM 112 to the CPU CORE 121 are shown, and the tag RAM and a transfer path accompanying it are omitted. However, each block is also provided with these circuits as in the configuration shown in
However, as clear from the physical disposition, block C1 is the closest to the CPU CORE 121, and block C4 is the farthest. Therefore, for the CPU CORE 121, the shortest data latency of blocks C1, C2, C3 and C4 are two, four, six and eight cycles, respectively.
Conversely, block C1 is the farthest from a CPU CORE 124, and block C4 is the nearest. Therefore, for the CPU CORE 124, the shortest data latency of blocks C1, C2, C3 and C4 are eight, six, four and two cycles, respectively.
From a CPU CORE 122, block C2 is the nearest, and blocks C1 and C3 are the second nearest, and block C4 is the farthest. Therefore, the shortest data latency of blocks C1, C2, C3 and C4 are four, two, four and six cycles, respectively.
From a CPU CORE 123, block C3 is the nearest, blocks C2 and C4 are second nearest, and block C1 is the farthest. Therefore, the shortest data latency of blocks C1, C2, C3 and C4 are six, four, two and four cycles, respectively.
According to such a CMP configuration, as to each of a plurality of processors that share memory on a chip, the average latency of memory access can be optimized.
According to the present invention, if a large capacity of memory is mounted on a semiconductor integrated circuit, the speed of memory access can be improved by utilizing a latency difference according to the storage position of data.
This is a continuation of an International Application No. PCT/JP02/09290, which was filed on Sep. 11, 2002.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/JP02/09290 | Sep 2002 | US |
Child | 11059472 | Feb 2005 | US |