1. Field of the Invention
The present invention is related to processing systems and processors, and more specifically to a method of operating a pipelined processor core with reconfigurable architecture.
2. Description of Related Art
In present-day processor cores, pipelines are used to execute multiple hardware threads corresponding to multiple instruction streams, so that more efficient use of processor resources can be provided through resource sharing and by allowing execution to proceed even while one or more hardware threads are waiting on an event.
In existing systems, specific resources and pipelines are typically provided in a given processor design, the execution resource types are fixed and in many instances, particular types of execution resources may be absent from certain processor cores, while other processor core types may have different execution resources. In some instances, resources within a processor core will remain unused except when needed on rare occasions, consuming die area that might otherwise be used to increase processor core performance.
It would therefore be desirable to provide methods for processing program instructions that provide improved used of the processor core resources.
The invention is embodied in a method of operation of a processor core.
The processor core includes multiple parallel instruction execution slices for executing multiple instruction streams in parallel and multiple dispatch queues coupled by a dispatch routing network to the execution slices according to a dispatch control logic that dispatches the instructions of the plurality of instruction streams via the dispatch routing network to issue queues of the plurality of parallel instruction execution slices. The processor core also includes a mode control logic controlled by a mode control signal that reconfigures a relationship between the parallel instruction execution slices such that in a first configuration, when the mode control signal is in a first state, at least two of the execution slices are independently operable for executing one or more hardware threads on each slice. In a second configuration, when the mode control signal is in a second state, the at least two parallel instruction execution slices are linked for executing instructions of a single thread.
The foregoing and other objectives, features, and advantages of the invention will be apparent from the following, more particular, description of the preferred embodiment of the invention, as illustrated in the accompanying drawings.
The novel features believed characteristic of the invention are set forth in the appended claims. The invention itself, however, as well as a preferred mode of use, further objectives, and advantages thereof, will best be understood by reference to the following detailed description of the invention when read in conjunction with the accompanying Figures, wherein like reference numerals indicate like components, and:
The present invention relates to processors and processing systems in which conventional pipelines are replaced with execution slices that can be reconfigured to efficiently allocate subsets of resources based on one or more thread mode control signals that may select between single-threaded mode, multi-threaded mode and different numbers of simultaneously executing hardware threads. The mode control signal may also select between configurations that combine two or more execution slices to form larger super-slices for handling wider operand operations, wider operators or vector operations.
Referring now to
Referring now to
The load-store portion of the instruction execution cycle, (i.e., the operations performed to maintain cache consistency as opposed to internal register reads/writes), is performed by a plurality of load-store (LS) slices LS0-LS7, which manage load and store operations as between instruction execution slices ES0-ES7 and a cache memory formed by a plurality of cache slices CS0-CS7 which are partitions of a lowest-order cache memory. Cache slices CS0-CS3 are assigned to partition CLA and cache slices CS4-CS7 are assigned to partition CLB in the depicted embodiment and each of load-store slices LS0-LS7 manages access to a corresponding one of the cache slices CS0-CS7 via a corresponding one of dedicated memory buses 40. In other embodiments, there may be not be a fixed partitioning of the cache, and individual cache slices CS0-CS7 or sub-groups of the entire set of cache slices may be coupled to more than one of load-store slices LS0-LS7 by implementing memory buses 40 as a shared memory bus or buses. Load-store slices LS0-LS7 are coupled to instruction execution slices ES0-ES7 by a write-back (result) routing network 37 for returning result data from corresponding cache slices CS0-CS7, such as in response to load operations. Write-back routing network 37 also provides communications of write-back results between instruction execution slices ES0-ES7. An address generating (AGEN) bus 38 and a store data bus 39 provide communications for load and store operations to be communicated to load-store slices LS0-LS7. For example, AGEN bus 38 and store data bus 39 convey store operations that are eventually written to one of cache slices CS0-CS7 via one of memory buses 40 or to a location in a higher-ordered level of the memory hierarchy to which cache slices CS0-CS7 are coupled via an I/O bus 41, unless the store operation is flushed or invalidated. AGEN bus 38 and store data bus 39 are shown as a single bus line in the Figures for clarity. Load operations that miss one of cache slices CS0-CS7 after being issued to the particular cache slice CS0-CS7 by one of load-store slices LS0-LS7 are satisfied over I/O bus 41 by loading the requested value into the particular cache slice CS0-CS7 or directly through cache slice CS0-CS7 and memory bus 40 to the load-store slice LS0-LS7 that issued the request. In the depicted embodiment, any of load-store slices LS0-LS7 can be used to perform a load-store operation portion of an instruction for any of instruction execution slices ES0-ES7, but that is not a requirement of the invention. Further, in some embodiments, the determination of which of cache slices CS0-CS7 will perform a given load-store operation may be made based upon the operand address of the load-store operation together with the operand width and the assignment of the addressable byte of the cache to each of cache slices CS0-CS7.
Instruction execution slices ES0-ES7 may issue internal instructions concurrently to multiple pipelines, e.g., an instruction execution slice may simultaneously perform an execution operation and a load/store operation and/or may execute multiple arithmetic or logical operations using multiple internal pipelines. The internal pipelines may be identical, or may be of discrete types, such as floating-point, scalar, load/store, etc. Further, a given execution slice may have more than one port connection to write-back routing network 37, for example, a port connection may be dedicated to load-store connections to load-store slices LS0-LS7, or may provide the function of AGEN bus 38 and/or data bus 39, while another port may be used to communicate values to and from other slices, such as special-purposes slices, or other instruction execution slices. Write-back results are scheduled from the various internal pipelines of instruction execution slices ES0-ES7 to write-back port(s) that connect instruction execution slices ES0-ES7 to write-back routing network 37. Cache slices CS0-CS7 are coupled to a next higher-order level of cache or system memory via I/O bus 41 that may be integrated within, or external to, processor core 20. While the illustrated example shows a matching number of load-store slices LS0-LS7 and execution slices ES0-ES7, in practice, a different number of each type of slice can be provided according to resource needs for a particular implementation.
Within processor core 20, an instruction sequencer unit (ISU) 30 includes an instruction flow and network control block 57 that controls dispatch routing network 36, write-back routing network 37, AGEN bus 38 and store data bus 39. Network control block 57 also coordinates the operation of execution slices ES0-ES7 and load-store slices LS0-LS7 with the dispatch of instructions from dispatch queues Disp0-Disp7. In particular, instruction flow and network control block 57 selects between configurations of execution slices ES0-ES7 and load-store slices LS0-LS7 within processor core 20 according to one or more mode control signals that allocate the use of execution slices ES0-ES7 and load-store slices LS0-LS7 by a single thread in one or more single-threaded (ST) modes, and multiple threads in one or more multi-threaded (MT) modes, which may be simultaneous multi-threaded (SMT) modes. For example, in the configuration shown in
In another configuration, according to another state of the mode control signal(s), clusters CLA and CLB are configured to execute instructions for a common pool of threads, or for a single thread in an ST mode. In such a configuration, cache slices CS0-CS7 may be joined to form a larger cache that is accessible by instructions dispatched to any of execution slices ES0-ES7 via any of load-store slices LS0-LS7. Cache slices CS0-CS7 may be organized into a partitioned cache, for example by using the operand address of each cache operation to determine which of cache slices CS0-CS7 or sub-groups of cache slices CS0-CS7 should support an operation. For example, cache lines may be split across sub-groups of cache slices CS0-CS3 and CS4-CS7, such that a particular bit of the operand address selects which of the two groups of cache slices CS0-CS3 and CS4-CS7 will contain the specified value, forming an interleave of cache lines. For example, cache slices CS0-CS3 may store data values having odd cache line addresses and cache slices CS4-CS7 may store data values having even cache line addresses. In such a configuration, the number of unique cache lines addresses indexed within the cache may be held constant when selecting between modes in which the cache slices CS0-CS7 are partitioned among sets of threads and modes in which cache slices CS0-CS7 are joined. In another example, data may be “striped” across cache slices CS0-CS7 using three bits of the operand address to determine a target one of cache slices CS0-CS7, forming an interleave mapping with a factor of 8. The above-illustrated examples are not exhaustive, and there are many different ways to assign data values to particular ones of cache slices CS0-CS7. For example, certain block or vector operations may deterministically span cache slices CS0-CS7 or sub-groups thereof, permitting early-decode-based assignment to one of execution slices ES0-ES7 or as among clusters CLA or CLB. Dispatch queues Disp0-Disp7 and/or execution slices ES0-ES7 may determine the appropriate target one (or more) of cache slices CS0-CS7 for an operation based on the operation type, address generation, a prediction structure, or other mechanisms. In one such exemplary embodiment of an operating mode, operations having odd operand addresses will be identified for processing on load-store slices LS0-LS3 only and cache slices CS0-CS3 are joined to only contain values representing odd addresses. Similarly, in such an exemplary embodiment of an operating mode, operations having even operand addresses are identified for processing by load-store slices LS4-LS7 only and cache slices CS4-CS7 only contain values representing even addresses. In the above-described configuration, cache slices CS0-CS7 may be conceptually joined, however, certain implementations such as vector or cache block operations do not require a full cross-bar routing between all load-store slices LS4-LS7, execution slices ES0-ES7 and cache slices CS0-CS7. In other configurations according to other modes, and/or in other embodiments of the invention, cache slices CS0-CS7 may be further partitioned to support SMT operations with four, eight, etc., independent partitions available to pools of hardware threads, as the illustrated embodiment having eight execution slices, eight load-store slices and eight cache slices is only illustrative and larger numbers of slices or clusters may be present in other embodiments of the invention.
Referring now to
Referring now to
Referring now to
Referring now to
Execution slice 42AA includes multiple internal execution pipelines 74A-74C and 72 that support out-of-order and simultaneous execution of instructions for the instruction stream corresponding to execution slice 42AA. The instructions executed by execution pipelines 74A-74C and 72 may be internal instructions implementing portions of instructions received over dispatch routing network 32, or may be instructions received directly over dispatch routing network 32, i.e., the pipelining of the instructions may be supported by the instruction stream itself, or the decoding of instructions may be performed upstream of execution slice 42AA. Execution pipeline 72 is illustrated separately multiplexed to show that single-pipeline, multiple-pipeline or both types of execution units may be provided within execution slice 42AA. The pipelines may differ in design and function, or some or all pipelines may be identical, depending on the types of instructions that will be executed by execution slice 42AA. For example, specific pipelines may be provided for address computation, scalar or vector operations, floating-point operations, etc. Multiplexers 77A-77C provide for routing of execution results to/from history buffer 76 and routing of write-back results to write-back routing network 37, I/O routing network 39 and AGEN routing network(s) 38 that may be provided for routing specific data for sharing between slices or operations, or for load and store address and/or data sent to one or more of load-store slices LS0-LS7. Data, address and recirculation queue (DARQ) 78 holds execution results or partial results such as load/store addresses or store data that are not guaranteed to be accepted immediately by the next consuming load-store slice LS0-LS7 or execution slice ES0-ES7. The results or partial results stored in DARQ 78 may need to be sent in a future cycle, such as to one of load-store slices LS0-LS7, or to special execution units such as one of cryptographic processors 34A,34B. Data stored in DARQ 78 may then be multiplexed onto AGEN bus 38 or store data bus 39 by multiplexers 77B or 77C, respectively.
Referring now to
While the invention has been particularly shown and described with reference to the preferred embodiments thereof, it will be understood by those skilled in the art that the foregoing and other changes in form, and details may be made therein without departing from the spirit and scope of the invention.
The present application is a Continuation of U.S. patent application Ser. No. 14/594,716, filed on Jan. 12, 2015 and claims priority thereto under 35 U.S.C. § 120. The disclosure of the above-referenced parent U.S. patent application is incorporated herein by reference.
Number | Name | Date | Kind |
---|---|---|---|
4858113 | Saccardi | Aug 1989 | A |
5055999 | Frank et al. | Oct 1991 | A |
5095424 | Woffinden et al. | Mar 1992 | A |
5471593 | Branigin | Nov 1995 | A |
5475856 | Kogge | Dec 1995 | A |
5553305 | Gregor et al. | Sep 1996 | A |
5630149 | Bluhm | May 1997 | A |
5680597 | Kumar et al. | Oct 1997 | A |
5822602 | Thusoo | Oct 1998 | A |
5996068 | Dwyer, III et al. | Nov 1999 | A |
6026478 | Dowling | Feb 2000 | A |
6044448 | Agrawal et al. | Mar 2000 | A |
6073215 | Snyder | Jun 2000 | A |
6073231 | Bluhm et al. | Jun 2000 | A |
6092175 | Levy et al. | Jul 2000 | A |
6112019 | Chamdani et al. | Aug 2000 | A |
6119203 | Snyder et al. | Sep 2000 | A |
6138230 | Hervin et al. | Oct 2000 | A |
6145054 | Mehrotra et al. | Nov 2000 | A |
6170051 | Dowling | Jan 2001 | B1 |
6212544 | Borkenhagen et al. | Apr 2001 | B1 |
6237081 | Le et al. | May 2001 | B1 |
6286027 | Dwyer, III et al. | Sep 2001 | B1 |
6311261 | Chamdani et al. | Oct 2001 | B1 |
6336183 | Le et al. | Jan 2002 | B1 |
6356918 | Chuang et al. | Mar 2002 | B1 |
6381676 | Aglietti et al. | Apr 2002 | B2 |
6425073 | Roussel et al. | Jul 2002 | B2 |
6463524 | Delaney et al. | Oct 2002 | B1 |
6487578 | Ranganathan | Nov 2002 | B2 |
6549930 | Chrysos et al. | Apr 2003 | B1 |
6564315 | Keller et al. | May 2003 | B1 |
6728866 | Kahle et al. | Apr 2004 | B1 |
6732236 | Favor | May 2004 | B2 |
6839828 | Gschwind et al. | Jan 2005 | B2 |
6868491 | Moore | Mar 2005 | B1 |
6883107 | Rodgers et al. | Apr 2005 | B2 |
6944744 | Ahmed et al. | Sep 2005 | B2 |
6948051 | Rivers et al. | Sep 2005 | B2 |
6954846 | Leibholz et al. | Oct 2005 | B2 |
6978459 | Dennis et al. | Dec 2005 | B1 |
7020763 | Saulsbury et al. | Mar 2006 | B2 |
7024543 | Grisenthwaite et al. | Apr 2006 | B2 |
7086053 | Long et al. | Aug 2006 | B2 |
7093105 | Webb, Jr. et al. | Aug 2006 | B2 |
7100028 | McGrath et al. | Aug 2006 | B2 |
7114163 | Hardin et al. | Sep 2006 | B2 |
7124160 | Saulsbury et al. | Oct 2006 | B2 |
7155600 | Burky et al. | Dec 2006 | B2 |
7191320 | Hooker et al. | Mar 2007 | B2 |
7263624 | Marchand et al. | Aug 2007 | B2 |
7290261 | Burky et al. | Oct 2007 | B2 |
7302527 | Barrick et al. | Nov 2007 | B2 |
7386704 | Schulz et al. | Jun 2008 | B2 |
7395419 | Gonion | Jul 2008 | B1 |
7469318 | Chung et al. | Dec 2008 | B2 |
7478198 | Latorre et al. | Jan 2009 | B2 |
7478225 | Brooks et al. | Jan 2009 | B1 |
7512724 | Dennis et al. | Mar 2009 | B1 |
7565652 | Janssen et al. | Jul 2009 | B2 |
7600096 | Parthasarathy et al. | Oct 2009 | B2 |
7669035 | Young et al. | Feb 2010 | B2 |
7669036 | Brown et al. | Feb 2010 | B2 |
7694112 | Barowski et al. | Apr 2010 | B2 |
7721069 | Ramchandran et al. | May 2010 | B2 |
7793278 | Du et al. | Sep 2010 | B2 |
7836317 | Marchand et al. | Nov 2010 | B2 |
7889204 | Hansen et al. | Feb 2011 | B2 |
7926023 | Okawa et al. | Apr 2011 | B2 |
7975134 | Gonion | Jul 2011 | B2 |
7987344 | Hansen et al. | Jul 2011 | B2 |
8046566 | Abernathy et al. | Oct 2011 | B2 |
8074224 | Nordquist et al. | Dec 2011 | B1 |
8099556 | Ghosh et al. | Jan 2012 | B2 |
8103852 | Bishop et al. | Jan 2012 | B2 |
8108656 | Katragadda et al. | Jan 2012 | B2 |
8135942 | Abernathy et al. | Mar 2012 | B2 |
8140832 | Mejdrich et al. | Mar 2012 | B2 |
8141088 | Morishita et al. | Mar 2012 | B2 |
8166282 | Madriles et al. | Apr 2012 | B2 |
8219783 | Hara | Jul 2012 | B2 |
8250341 | Schulz et al. | Aug 2012 | B2 |
8335892 | Minkin et al. | Dec 2012 | B1 |
8386751 | Ramchandran et al. | Feb 2013 | B2 |
8412914 | Gonion | Apr 2013 | B2 |
8464025 | Yamaguchi et al. | Jun 2013 | B2 |
8489791 | Byrne et al. | Jul 2013 | B2 |
8555039 | Rychlik | Oct 2013 | B2 |
8656401 | Venkataramanan et al. | Feb 2014 | B2 |
8683182 | Hansen et al. | Mar 2014 | B2 |
8713263 | Bryant | Apr 2014 | B2 |
8850121 | Ashcraft et al. | Sep 2014 | B1 |
8966232 | Tran | Feb 2015 | B2 |
8984264 | Karlsson et al. | Mar 2015 | B2 |
9223709 | O'Bleness | Dec 2015 | B1 |
9519484 | Stark | Dec 2016 | B1 |
20020194251 | Richter et al. | Dec 2002 | A1 |
20030120882 | Granston et al. | Jun 2003 | A1 |
20030163669 | DeLano | Aug 2003 | A1 |
20040111594 | Feiste et al. | Jun 2004 | A1 |
20040216101 | Burky et al. | Oct 2004 | A1 |
20060095710 | Pires Dos Reis Moreira | May 2006 | A1 |
20060106923 | Balasubramonian | May 2006 | A1 |
20070022277 | Iwamura et al. | Jan 2007 | A1 |
20070204137 | Tran | Aug 2007 | A1 |
20080133885 | Glew | Jun 2008 | A1 |
20080270749 | Ozer | Oct 2008 | A1 |
20080313424 | Gschwind | Dec 2008 | A1 |
20090037698 | Nguyen | Feb 2009 | A1 |
20090113182 | Abernathy et al. | Apr 2009 | A1 |
20100100685 | Kurosawa et al. | Apr 2010 | A1 |
20100161945 | Burky | Jun 2010 | A1 |
20120110271 | Boersma et al. | May 2012 | A1 |
20120246450 | Abdallah | Sep 2012 | A1 |
20140215189 | Airaud et al. | Jul 2014 | A1 |
20140244239 | Nicholson et al. | Aug 2014 | A1 |
20150134935 | Blasco | May 2015 | A1 |
Number | Date | Country |
---|---|---|
101021778 | Aug 2007 | CN |
101676865 | Mar 2010 | CN |
101876892 | Nov 2010 | CN |
102004719 | Apr 2011 | CN |
Entry |
---|
U.S. Appl. No. 14/501,152, filed Sep. 30, 2014, Chu, et al. |
U.S. Appl. No. 14/869,305, filed Sep. 29, 2015, Chu, et al. |
“Method and system for Implementing Register “Threads” in a Simultaneously-Multithreaded (SMT) Processor Core”, An IP.com Prior Art Database Technical Disclosure, Authors et. al.: Disclosed Anonymously, IP.com No. IPCOM000199825D, IP.com Electronic Publication: Sep. 17, 2010, pp. 1-4, <http://ip.com/IPCOM/000199825>. |
List of IBM Patents or Patent Applications Treated as Related, 3 pages. |
Pechanek, et al., “ManArray Processor Interconnection Network: An Introduction”, Euro-Par' 99 Parallel Processing, Lecture Notes in Computer Science, 5th International Euro-Par Conference, Aug. 31-Sep. 3, 1999 Proceedings, pp. 761-765, vol. 1685, Spring Berlin Heidelberg, Toulouse, France. |
Pechanek, et al., “The ManArray Embedded Processor Architecture”,Proceedings of the 26th Euromicro Conference, IEEE Computer Society, Sep. 5-7, 2000, pp. 348-355, vol. 1, Maastricht. |
Czajkowski, et al., “Resource Management for Extensible Internet Servers”, Proceedings of the 8th ACM SIGOPS European Workshop on Support for Composing Distributed Applications, Sep. 1998, pp. 33-39, ACM, Portugal. |
Bridges, et al., “A CPU Utilization Limit for Massively Parallel MIMD Computers”, Fourth Symposium on the Frontiers of Massively Parallel Computing, Oct. 19-21, 1992, pp. 83-92, IEEE, VA, US. |
U.S. Appl. No. 14/594,716, filed Jan. 12, 2015, Eisen, et al. |
U.S. Appl. No. 14/595,549, filed Jan. 13, 2015, Brownscheidle, et al. |
U.S. Appl. No. 14/595,635, filed Jan. 13, 2015, Ayub, et al. |
U.S. Appl. No. 14/274,927, filed May 12, 2014, Eisen, et al. |
U.S. Appl. No. 14/300,563, filed Jun. 10, 2014, Eisen, et al. |
U.S. Appl. No. 14/274,942, filed May 12, 2014, Eisen, et al. |
U.S. Appl. No. 14/302,589, filed Jun. 12, 2014, Eisen, et al. |
U.S. Appl. No. 14/480,680, filed Sep. 9, 2014, Boersma, et al. |
U.S. Appl. No. 14/574,644, filed Dec. 18, 2014, Boersma, et al. |
U.S. Appl. No. 14/724,073, filed May 28, 2015, Brownscheidle, et al. |
U.S. Appl. No. 14/724,268, filed May 28, 2015, Ayub, et al. |
U.S. Appl. No. 15/442,810, filed Feb. 27, 2017, Eisen, et al. |
Gebhart et al., “A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors”, ACM Transactions on Computer Systems, Apr. 2012, pp. 8:1-8:38, (38 pages in pdf),vol. 30, No. 2, Article 8, ACM. |
Office Action in U.S. Appl. No. 14/594,716 dated Jun. 14, 2017, 57 pages (pp. 1-57 in pdf). |
Number | Date | Country | |
---|---|---|---|
20160202991 A1 | Jul 2016 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 14594716 | Jan 2015 | US |
Child | 14723940 | US |