The present invention relates to a method for inter-cluster communications, more particularly, the present invention relates to lessen the interconnection complexity of register files and to reduce the silicon area or power consumption of high-performance digital signal processors.
Modern multimedia and communication systems are apt to require capability of giga-operations per second. IC techniques today are able to easily integrate tens to hundreds of arithmetic units (AUs) into one processor, and when the processor is working on the clock frequency of hundreds of MEGA-Hz to some GIGA-Hz, the above requirement can be easily achieved. But the major design problem is on how to organize the data to flow smoothly among the parallel functional units (FUs) in limited data bandwidth.
Traditional RISC processors separate memory accesses from computations to lessen the complexity of this problem. But the extensibility of the centralized register file in its structure, which is in charge of the data exchange and buffering, is very bad, and has become the bottleneck of high-performance processor designs. Suppose that P ports are needed for N FUs. Then the silicon area, the access time, and the power consumption of a centralized register file containing n registers is to grow in direct ratio of about nP2 and n1/2P and nP2. n and N are approximately in direct ratio and P is about 3˜4 N, which means the growth rates of area, access time, and power consumption are N3 and N3/2 and N3 respectively. So, nowadays, centralized register file designs of a processor that contains 4 to 8 parallel FUs have covered almost a half of the processor core and its access time may be accomplished through more than one pipeline stage. The major key to a successful processor design is on how to design a register file of high efficiency and low power consumption.
Today, most efficient register file designs are by ways of partitioning, which means to partition the said centralized register file into several blocks to reduce the overall complexity. There are two ways for partitioning a register file:
1. Clustering
FUs are partitioned into several clusters, where the FUs in each cluster are to access the registers in the belonging cluster and the data exchanges between clusters are accomplished by extra interconnection network. Each cluster of symmetric partitioning usually has complete FUs, which is able to accomplish a given task independently, so that the data exchange is not frequent. Therefore, the inter-cluster communication is minimal. On the contrary, non-symmetric clusters need extensive data exchanges. For instance, the distributed register file (as shown in
Hierarchical register file is a very special case from non-symmetric partitioning (as shown in
Data Exchange Mechanisms Between Clusters:
Different ways of clustering require different data exchange mechanisms, which can be classified as the following three methods:
A. Copy Instructions (as Shown in
The inter-cluster communication is done by explicit “copy” instructions. It requires some extra ports of the register files in each cluster. One implementation is to use the existing slots for the copy instructions and thus to reuse the existing input (or output) ports of the register files. The drawback is that some FUs lie idle while executing the copy instructions. The other implementation is to use dedicated instruction slots at the cost of additional input and output ports. By the way, the extra slots might significantly increase the program size.
B. Extended Accesses (as Shown in
The FUs have limited read or write accesses to the register files of other clusters. The register file of each cluster needs to support the corresponding read or, write ports with extra external interconnection network and control.
C. Shared Storage (as Shown in
Each cluster has access ports connected to a common storage and data are exchanged through this shared storage.
2. Banking
The above techniques with FU clustering offer respective temporary registers for different computing clusters and use extra interconnection network for data exchange between the clusters. Yet this technique is by using the way how physical ports and logical ports are mapped to reduce the complexity of the register file, where each FU is able to access every register directly. For example, a centralized register file (i.e. requires P=3N) can be divided into N banks, and each bank has only 3 ports. It needs hardware stalls or software techniques to resolve the access conflicts.
The above methods all need extra ports and interconnection network to exchange data between clusters and they consume large silicon area and significant power. In addition, most of the above methods require redundant data movements, which waste more time and power.
The present invention divides a centralized register file into local and global registers. Global registers are to act as the communication mechanism between each cluster by way of permutation to eliminate the extra ports for inter-cluster communications. It is able to move data by permutation of the registers.
Another purpose of the present invention is to use it in a structure like high-performance DSP, which needs high data bandwidth so that the data moving between registers are greatly reduced to diminish power consumption. Moreover, the present invention is able to properly partition the register file, so as to reduce the silicon area and the access time.
To achieve the above goals, the present invention describes a method for the inter-cluster communication that employs register permutation, where the clusters exchange data by mapping the interconnection ports of the said global registers dynamically to the clusters via permutation. Each register block can be assigned only exclusively to a cluster, and thus it requires access ports for a single cluster. Because the data exchange is done by changing the port mapping only and it has nothing to do with the actual data movements, an inter-cluster communication mechanism with high bandwidth and low power consumption is achieved.
The present invention will be better understood from the following detailed descriptions of the preferred embodiments of the invention, taken in conjunction with the accompanying drawings, in which
The following descriptions of the preferred embodiments are provided to understand the features and the structures of the present invention.
Please refer to
The followings are two examples of the hardware embodiments:
( ) 2-Way VLIW Digital Signal Processor (DSP):
As shown in
( ) 4-Way VLIW DSP
As shown in
Remarks:
35 instruction cycles for 2 output; i.e. 17.5 cycle/output 66 taps/cycle SIMD MAC: MAC_V r0, r8, r9; r0=r0+r8.Hi*r9.Hi & r1=r1+r8.Lo*r9.Lo
This is an example of a 64-tap FIR filter, which generates 1024 results. The memory is half-word addressing, where the inputs and the outputs are stored as 16-bit fractional and 32-bit fixed-point numbers respectively. The inner loop (i7,i8) loads 4 16-bit inputs and 4 16-bit constants to 2 32-bit r8 registers and 2 32-bit r9 registers. The L/S units update the address registers r0, r1, and the AUs execute SIMD MAC operations simultaneously. After multiplying and accumulating 32 16-bit items with 40-bit accumulators, r0 and r1 are summed up and stored to the ring (global) register r8. In the end, r8 is stored to the memory through LS.
The preferred embodiment herein disclosed is not intended to unnecessarily limit the scope of the invention. Therefore, simple modifications or variations belonging to the equivalent of the scope of the claims and the instructions disclosed herein for a patent are all within the scope of the present invention.