The present disclosure relates generally to computer architecture and, more particularly, to computer processors.
Very long instruction word (VLIW) processors are known in the art, an example of which is shown in
Given this architecture, instructions enter the instruction decoder 105 from an external source. The instruction decoder 105 converts the received instructions into a decoded internal format that is wider but easier to process. The decoded instructions are subsequently used to control the operation of the data path components, which include the input/output buffer 130, the register file 110, and the functional units 120. Since the various operation of conventional processors is known in the art, only a truncated discussion of such processors is provided herein.
The register file 110, which holds temporary working data, is relatively quickly accessible compared to external memory. The functional units (or issue slots) 120 perform the actual computational work associated with the processor.
The control sequencing hardware 115, the register file 110, and the functional units 120 are shown in greater detail in
Specifically,
Each functional unit 322, 324, 326, 328 has two read ports, through which the functional unit receives data, and a single write port, through which the functional unit outputs data. In other words, for the example in
If the register file 310 is a sixty-four (64) entry, thirty-two (32) bit register file, then six (6) bits are required to access the 64-entry register file 310. Thus, if each instruction has a two (2) bit operation field, and 6 bits are required to access the 64-entry register file 310, then the processor would operate on 80-bit instruction words (designated herein as INST[79:0]). For example, the values of R1 through R8 (values that appear on each of the read ports of the register file 310), W1 through W4 (values that appear on each of the write ports of the register file 310), and the control bits for each of the functional units can be represented as:
R1=INST[79:74]
R2=INST[73:68]
W1=INST[67:62]
A1C=INST[61:60]
R3=INST[59:54]
R4=INST[53:48]
W2=INST[47:42]
A2C=INST[41:40]
R5=INST[39:34]
R6=INST[33:28]
W3=INST[27:22]
M1C=INST[21:20]
R7=INST[19:14]
R8=INST[13:08]
W4=INST[07:02]
M2C=INST[1:0]
Given the 64-entry, 32-bit register file 310 of
(64 entries)×(32 bits)×(8 read ports+4 write ports)=24576 bits
As is known, for VLIW processors, each instruction usually contains several operand address fields per operation. Given the high instruction width of such processors, the cost of on-chip storage increases while the efficiency of off-chip instruction decreases. This is often the primary limiting factor in system performance. For at least this reason, there is a heretofore-unaddressed need in the industry.
Some embodiments, among others, provide schemes in which the register space of a processor is modified to permit greater access to registers by instructions. Such modifications permit shorter instruction words for multiple-issue devices, thereby reducing instruction fetch bandwidth and, correspondingly, on-chip costs associated with the storage of the instruction words.
Other systems, devices, methods, features, and advantages will be or become apparent to one with skill in the art upon examination of the following drawings and detailed description. It is intended that all such additional systems, methods, features, and advantages be included within this description, be within the scope of the present invention, and be protected by the accompanying claims.
Many aspects of the disclosure can be better understood with reference to the following drawings. The components in the drawings are not necessarily to scale, emphasis instead being placed upon clearly illustrating the principles of the present disclosure. Moreover, in the drawings, like reference numerals designate corresponding parts throughout the several views.
Reference is now made in detail to the description of the embodiments as illustrated in the drawings. While several embodiments are described in connection with these drawings, there is no intent to limit the invention to the embodiment or embodiments disclosed herein. On the contrary, the intent is to cover all alternatives, modifications, and equivalents.
Instructions in very long instruction word (VLIW) processors usually contain several operand address fields per operation. These high instruction widths result in increased costs for on-chip storage of such instructions. Correspondingly, the high instruction widths decrease system efficiency. This is often the primary limiting factor in system performance.
Techniques are disclosed in which components are arranged in a specific configuration within the processor space, thereby permitting multiple operations within a single clock cycle with shorter instruction words. The shortened instruction words are accommodated with additional hardware components to the register space of the processor, and an additional preprocessing step, which can be executed by the compiler during code generation.
As shown in
To make a fairly even comparison, the global register file 210 of
{(4 local)×(16 entries)×[(32 bits)×(2 read ports+1 write port)]}+{(1 global)×(16 entries)×[(32 bits)×(1 read port+1 write port)]}=, 7168 bits
compared to the conventional processor shown in
(64 entries)×(32 bits)×(8 read ports+4 write ports)=24576 bits.
Thus, the processor of
Furthermore, as shown in
Thus, unlike the conventional structure of
the processor, in the embodiment of
Correspondingly, the size of each instruction (i.e., number of bits in each instruction) decreases by approximately fifteen percent (15%).
More generally, for embodiments that executes k instructions in a single clock cycle, if the operation field is j bits and there are n registers for each register file, then the size of the instruction word would be:
and the corresponding on-chip cost for the register files in the k-instruction-word processor would be:
(k local register files)×(n registers/file×m bits×[2 read+1 write])+(1 global register file)×(n registers/file×m bits×[1 read+1 write])
In order to achieve such an increase in on-chip silicon area, and the corresponding increase in processor efficiency, the layout of the various register files is modified, thereby providing greater access to register space than allowable by conventional processors. One embodiment of the modified layout, as shown in
Part of a general k-instruction-word processor is shown in
Corresponding to the k functional units, the k-instruction-word processor includes k2×1 MUXes. Each 2×1 MUX corresponds to one of the functional units, and each 2×1 MUX has a first input, a second input, and an output that is electrically coupled to the second read input of its corresponding functional unit. Additionally, each 2×1 MUX includes a control line CLa . . . CLd for selecting one of the two inputs. The control lines CLa . . . CLd are connected to an instruction decoder 260. Again, for
Along with the k 2×1 MUXes, the k-instruction-word processor includes k local register files, each of which corresponds to a respective 2×1 MUX and a respective functional unit. Each of the k local register files comprises n registers, and each register is capable of storing m bits. It should be appreciated that n and m are non-negative integers. Preferably, n is a power of 2. Each local register file has at least two read ports. The first local read port is electrically coupled to the first read input of its corresponding functional unit, while the second local read port is electrically coupled to the first input of its corresponding 2×1 MUX. In addition to the read ports, each register file has a local write port, which is electrically coupled to the write output of its corresponding functional unit. Shown in
The k-instruction-word processor further includes a k-input-one-output (k×1) MUX. Each of the k inputs is electrically coupled to one of the k functional units, such that the write outputs of the k functional units appear on the corresponding k inputs of the k×1 MUX. The k×1 MUX has a control line CLk connected to the instruction decoder 260 for selecting one of the k inputs. The selected input is placed on the output of the k×1 MUX. Thus, given k=4, the output of the first functional unit 240a is placed on the first input of the 4×1 MUX 250; the output of the second functional unit 240b is placed on the second input of the 4×1 MUX 250; the output of the third functional unit 240c is placed on the third input of the 4×1 MUX 250, and so on.
The k-instruction-word processor further includes a global register file of n registers. Each of the n registers is an m-bit register. The global register file has a global read port and a global write port. The global write port is coupled to the output of the k×1 MUX, and, hence, the output of the k×1 MUX appears at the write port of the global register file. The read port of the global register file is electrically coupled to the second input of each 2×1 MUX.
While the configuration of
The performance of the k-instruction-word processor of
As shown in
Thus, for k-instruction-word processors, a k-read-k-write global register file would be employed. Correspondingly the on-chip cost would be calculated as:
(k local register files)×(n registers/file×m bits×[2 read+1 write])+, (1 global register file)×(n registers/file×m bits×[k read+k write])
and, more specifically for a 4-instruction-word processor, the on chip costs would be:
As shown in the embodiments of
Although exemplary embodiments have been shown and described, it will be clear to those of ordinary skill in the art that a number of changes, modifications, or alterations to the invention as described may be made. For example, while embodiments having 32-bit registers are disclosed, those having skill in the art will appreciate that registers of various sizes can be used without adverse effect to the scope of the invention. For instance, 8-bit registers, 16-bit registers, 24-bit registers, 64-bit registers, 128-bit registers, or any arbitrary m-bit register, where m is an integer, can easily be substituted for the 32-bit registers.
Likewise, while 64 registers are shown in a preferred embodiment, it should be appreciated that the number of registers can be varied to accommodate various design needs. In that regard, it should be appreciated that the number of register banks is not intended as a limitation, but merely provided for illustrative purposes, and that n register banks, where n is an integer, can be implemented.
Similarly, while a machine that issues four (4) instructions per clock cycle is shown, those having skill in the art will appreciate that machines that issue multiple instructions, regardless of the number of concurrently-issued instructions, can be designed using the disclosed structures. In that regard, it should be appreciated that any k-instruction machine, where k is an integer, can be designed in accordance with the disclosed embodiments. Moreover, while instructions having 2-bit operation fields are shown in the disclosed embodiments, those having skill in the art will appreciate that any j-bit operation field, where j is an integer, can easily be implemented. It should also be appreciated that j, k, m, and n can be any integer value. In that regard, j, k, m, and n can be different integer values or, in some cases, can be the same integer value.
Also, given the disclosed architecture, one having ordinary skill in the art will be able to determine the corresponding preprocessing steps required by the compiler during code generation. Hence, those details are not discussed herein.
All such changes, modifications, and alterations should therefore be seen as within the scope of the disclosure.
Number | Name | Date | Kind |
---|---|---|---|
5165038 | Beard et al. | Nov 1992 | A |
5301340 | Cook | Apr 1994 | A |
5613152 | Van Meerbergen et al. | Mar 1997 | A |
6219777 | Inoue | Apr 2001 | B1 |
6629232 | Arora et al. | Sep 2003 | B1 |
Number | Date | Country | |
---|---|---|---|
20060095735 A1 | May 2006 | US |