Certain VLSI processor architectures now group execution units as clusters to process bundled instructions. One “bundle” of instructions has three instructions; a cluster operates to process one bundle, or more, of instructions.
Certain VLSI processor architectures also use “multi-threading” techniques to process instructions through pipeline stages.
The invention advances the state of the art in processing architectures incorporating logic such as shown in
The invention of one aspect processes bundles of instructions preferentially through clusters such that bypassing is substantially maintained within a single cluster. Alternatively, in another aspect, the invention processes bundles of instructions preferentially through multiple clusters, with bypassing therebetween, to increase “per thread” performance. The cluster architectures of the invention thus preferably include capability to process “multi-threaded” instructions.
In one preferred aspect, the invention provides a “configurable” processor architecture that operates in one of two modes: in a “wide” mode of operation, the processor's internal clusters collectively process bundled instructions of one thread of a program at the same time; in a “throughput” mode of operation, those clusters independently process instruction bundles of separate program threads. Accordingly, the invention of this aspect provides advantages by flexibly operating with (a) a high degree of parallelism (i.e., in the “throughput” mode) or alternatively (b) a high degree of single threaded performance (i.e., in the “wide” mode). An independent user desiring maximum single thread performance can therefore select the wide mode preferentially; another user desiring to process many orders simultaneously, and in real-time (e.g., in a business such as an airline company), can therefore select the throughput mode preferentially.
The invention is next described further in connection with preferred embodiments, and it will become apparent that various additions, subtractions, and modifications can be made by those skilled in the art without departing from the scope of the invention.
A more complete understanding of the invention may be obtained by reference to the drawings, in which:
Program instructions are decoded in the thread decode unit 130. Depending on the configuration bit, decode unit 130 detects and then distributes bundled instructions to program counters 104 according to the threads associated with the instructions. If the configuration bit is set to wide mode, then bundled instructions from the same thread are processed through multiple clusters 102 at the same time. If the configuration bit is set to throughput mode, then bundled instructions from one thread are processed through one program counter 104, and through a corresponding cluster 102; bundled instructions from other threads are likewise processed through another program counter and cluster pair 104, 102. An instruction memory 132 may optionally function to store bundled instructions, or to multiplex bundled instructions by and between different program counters 104 and different clusters 102, as a matter of design choice.
By way of example, in the throughput mode, three instructions from a single thread are bundled, by thread decode unit 130, and then processed through program counter and cluster 104(1), 102(1); three instructions from another thread are bundled, by thread decode unit 130, and processed through program counter and cluster 104(2), 102(2).
Each cluster 102 includes several pipelines and stage execution units so as to simultaneously perform, for example, F,D,E,W on multiple instructions within the bundle.
Each core 302 functions as a cluster, in accordance with the invention. In the wide mode, one thread may for example execute four bundles through both cores 302; inter-cluster communication occurs, with cycle delays, through multiplexers 312. In the wide mode, for example, core 302A may execute instructions corresponding to even program counter steps 0,2,4, etc.; and core 302B may execute instructions corresponding to odd program counters steps 1,3,5, etc. The cycle delays are eliminated through multiplexers 312 when architecture 300 operates in the throughput mode, as instruction bundles of common threads are only executed on a single core 302. The following illustrate how four bundles may be processed through architecture 300:
The invention thus attains the features set forth above, among those apparent from the preceding description. Since certain changes may be made in the above methods and systems without departing from the scope of the invention, it is intended that all matter contained in the above description or shown in the accompanying drawing be interpreted as illustrative and not in a limiting sense. It is also to be understood that the following claims are to cover all generic and specific features of the invention described herein, and all statements of the scope of the invention which, as a matter of language, might be said to fall there between.
Number | Name | Date | Kind |
---|---|---|---|
5303354 | Higuchi et al. | Apr 1994 | A |
5689677 | MacMillan | Nov 1997 | A |
5729761 | Murata et al. | Mar 1998 | A |
5903771 | Sgro et al. | May 1999 | A |
6098165 | Panwar et al. | Aug 2000 | A |
6151668 | Pechanek et al. | Nov 2000 | A |
6269437 | Batten et al. | Jul 2001 | B1 |
6272616 | Fernando et al. | Aug 2001 | B1 |
6446191 | Pechanek et al. | Sep 2002 | B1 |
6629232 | Arora et al. | Sep 2003 | B1 |
6954846 | Leibholz et al. | Oct 2005 | B2 |
20030033509 | Leibholz et al. | Feb 2003 | A1 |
20030093655 | Gosior et al. | May 2003 | A1 |
20030135711 | Shoemaker et al. | Jul 2003 | A1 |
Number | Date | Country |
---|---|---|
2000-029731 | Jan 2000 | JP |
WO 9924903 | May 1999 | WO |
Number | Date | Country | |
---|---|---|---|
20030163669 A1 | Aug 2003 | US |