Instructions in microprocessors are often re-dispatched for execution one or more times due to pipeline errors or data hazards. For example, an instruction may need to be re-dispatched where an instruction refers to a result that has not yet been calculated or retrieved. Because it is not known whether an unpredicted pipeline stall will arise during execution of the instruction, execution of the instruction may lead to runahead operation configured to detect other misses while the initial miss is being resolved.
In modern microprocessors, architectural-level instructions are often divided into micro-operations for execution in a pipeline. Such micro-operations may be dispatched individually or as bundles of micro-operations to various execution mechanisms in the microprocessor. When one or more micro-operations are dispatched, it is not known whether execution of a micro-operation will complete or not. Put another way, it is not known whether a miss or an exception will arise during execution of a micro-operation. In some examples, if a micro-operation does not complete, the micro-operation may be re-executed after the unexpected stall is resolved. Because other misses may arise, it is possible that a micro-operation may be re-executed several times prior to completion of the micro-operation.
A common pipeline execution stall that may arise during execution of a bundle is a load operation that results in a cache miss. Such cache misses may trigger an entrance into a runahead mode of operation (hereafter referred to as “runahead”) configured to detect other cache misses, instruction translation lookaside buffer misses, branch mispredicts, or the like, while the initial load miss is being resolved. As used herein, runahead describes virtually any suitable speculative execution scheme resulting from a long-latency event, such as a cache miss where the resulting load event pulls the missing instruction or data from a slower access memory location. Once the initial load miss is resolved, the microprocessor exits runahead and the instruction is re-executed. However, re-fetching the instruction from the instruction or unified cache may slow processor operation. Accordingly, various embodiments are disclosed herein that are related to re-dispatching an instruction selected for re-execution from a buffer upon a microprocessor re-entering a particular execution location after runahead. In one example, a microprocessor is provided. The example microprocessor includes fetch logic, one or more execution mechanisms for executing a retrieved instruction provided by the fetch logic, and scheduler logic for scheduling the instruction for execution. The example scheduler logic includes a buffer for storing the retrieved instruction and one or more additional instructions, the scheduler logic being configured, upon the microprocessor re-entering at a particular execution location after runahead, to re-dispatch, from the buffer, an instruction that has been previously dispatched to one of the execution mechanisms.
A memory controller 110G may be used to handle the protocol and provide the signal interface required of main memory 110D and to schedule memory accesses. The memory controller can be implemented on the processor die or on a separate die. It is to be understood that the memory hierarchy provided above is non-limiting and other memory hierarchies may be used without departing from the scope of this disclosure.
Microprocessor 100 also includes a pipeline, illustrated in simplified form in
As shown in
Scheduler logic 124 includes a buffer 126, comprising checkpointed buffer 126A and non-checkpointed buffer 126B, for storing one or more instructions. As instructions enter scheduler logic 124, they are queued in buffer 126. The instructions are held in buffer 126 even after the instructions are dispatched to execution logic 130. Thus, it may be possible to re-dispatch previously dispatched instructions from the buffer in response to pipeline discontinuities that cause an instruction to fail to complete after dispatch, such as a load that misses in a data cache. Such re-dispatch may thereby be performed without re-fetching the instructions from outside of the buffer. An instruction may thus be re-dispatched or “re-played” one or more times until it is determined that the instruction achieves a completed state, at which time the instruction may be logically and/or physically removed from the buffer.
As shown in
In some embodiments, buffer 126 may be configured to store instructions in the form of instruction set architecture (ISA) instructions. Additionally or alternatively, in some embodiments, buffer 126 may be configured to store bundles of micro-operations, each micro-operation corresponding to one or more ISA instructions or parts of ISA instructions. It will be appreciated that virtually any suitable arrangement for storing instructions in bundles of micro-operations may be employed without departing from the scope of the present disclosure. For example, in some embodiments, a single instruction may be stored in a plurality of bundles of micro-operations, while in some embodiments a single instruction may be stored as a bundle of micro-operations. In yet other embodiments, a plurality of instructions may be stored as a bundle of micro-operations. In still other embodiments, buffer 126 may store individual instructions or micro-operations, e.g., instructions or micro-operations that do not comprise bundles at all.
In order for the boundary pointer to point to a valid location upon entry to runahead, buffer 126 is configured to be at least as large as a number of bundles included from a particular bundle causing entry to runahead to a last bundle in the same instruction as that particular bundle, which may be referred to as a tail size for an instruction. Thus, in embodiments where bundles of micro-operations are stored in buffer 126, buffer 126 may be sized according to a pre-determined tail size for micro-operations associated with an architectural instruction. Such ISA instructions and/or bundles may be re-dispatched to execution logic 130 any suitable number of times.
Pipeline 102 also includes mem logic 132 for performing load and/or store operations and writeback logic 134 for writing the result of operations to an appropriate location such as register 109. It should be understood that the above stages shown in pipeline 102 are illustrative of a typical RISC implementation, and are not meant to be limiting. For example, in some embodiments, the fetch logic and the scheduler logic functionality may be provided upstream of a pipeline, such as compiling VLIW instructions or code-morphing. In some other embodiments, the scheduler logic may be included in the fetch logic and/or the decode logic of the microprocessor. More generally a microprocessor may include fetch, decode, and execution logic, each of which may comprise one or more stages, with mem and write back functionality being carried out by the execution logic. The present disclosure is equally applicable to these and other microprocessor implementations, including hybrid implementations that may use VLIW instructions and/or other logic instructions.
In the described examples, instructions may be fetched and executed one at a time, possibly requiring multiple clock cycles. During this time, significant parts of the data path may be unused. In addition to or instead of single instruction fetching, pre-fetch methods may be used to improve performance and avoid latency bottlenecks associated with read and store operations (i.e., the reading of instructions and loading such instructions into processor registers and/or execution queues). Accordingly, it will be appreciated that virtually any suitable manner of fetching, scheduling, and dispatching instructions may be used without departing from the scope of the present disclosure.
In the example shown in
As shown in
Pointer “R” references a read pointer that indicates a buffer address or a buffer location of an instruction entry selected to be read in preparation for dispatch to the execution mechanism. For example, the read pointer may point to an address for the selected instruction so that various dependencies of the selected instruction may be read prior to dispatch of the instruction. The read pointer is updated when the selected instruction (or portion thereof) is issued to the execution mechanism. In the example shown in
Pointer “D” references a de-allocation pointer that indicates a buffer address or a buffer location of an instruction entry that has completed and is ready to be logically, and in some embodiments, physically removed from the buffer. Thus, the de-allocation pointer points to an instruction that is the next instruction to be removed from the buffer, by being overwritten and/or deleted, for example. Thus, an instruction that is inserted into the buffer by the allocation pointer and is read by the read pointer will remain in the buffer until being indicated for removal by the de-allocation pointer. The de-allocation pointer is updated when the selected instruction (or portion thereof) completes. In the example shown in
In some embodiments, the de-allocation pointer may be advanced by moving from bundle to bundle (e.g., from A1 to A2), as the bundles may be dispatched and de-allocated from the buffer on a bundle-by-bundle basis. However, in some embodiments, the de-allocation pointer may be advanced by moving from instruction to instruction, even if the buffer stores bundles. For example, the de-allocation pointer may be advanced from A1 to B1, skipping A2, even as the bundles are dispatched individually (e.g., A1, A2, B1, etc.). Accordingly, incremental advancement of the de-allocation pointer and one or more of the other pointers, such as the read pointer, may be different during a working state of the buffer in some embodiments.
It will be appreciated that though the sequence of the A, R, and D pointers is constrained in the order encountered by an instruction (e.g., in the order A, R, and D), any suitable number of buffer positions may separate the A, R, and D pointers, and further, other suitable buffer pointers may intervene as described below.
Pointer “RY” references a replay pointer that indicates a buffer address or a buffer location of an instruction that is to be re-dispatched for re-execution. The replay pointer is updated along with the de-allocation pointer, as once an instruction (or portion thereof) has completed, replay of that instruction is no longer an issue. In the embodiment shown in
In the example shown in
In some embodiments, re-dispatch or replay of an instruction from the buffer may result from exit of the microprocessor from a runahead state. Continuing with the example shown in
Because it is unknown when runahead will be entered, and the buffer checkpointed,
In some embodiments where the buffer stores ISA instructions, tracking the address or the location of the last complete instruction in the buffer may be comparatively simple because each ISA instruction may have an instruction pointer associated with it. In some other embodiments, such as in embodiments where bundles of micro-operations are stored in the buffer, it may be that an instruction pointer associated with an instruction is included in the last bundle alone of a set of bundles forming an instruction. In such embodiments, the restart instruction pointer may be updated when the last bundle for an instruction is inserted into the buffer. Further, because it can be difficult to identify the end of an instruction from bundles of micro-operations stored in the buffer, in such embodiments, a boundary instruction pointer may be used to track a boundary between a bundle that corresponds to the restart instruction pointer and subsequent bundles, e.g., for tracking a boundary between a last complete instruction held in the buffer and a subsequent bundle belonging to an incomplete instruction held in the buffer. In such embodiments, the restart instruction pointer may then track an address or the location of a last complete instruction held in the buffer as tracked by the boundary instruction pointer. In such embodiments, the boundary instruction pointer may be updated concurrently with the restart instruction pointer. In
In the embodiments described herein, the buffer exists in checkpointed and non-checkpointed versions. Thus, returning to the example shown in
At 302, method 300 includes requesting data by an instruction in execution. In some embodiments, fetch request 302 may be issued by a data pointer in memory logic pointing to an address or a location for the data. For example, when the instruction pointer of the memory logic points to the address for the data, the memory logic is directed to that address to retrieve that particular data. In the example shown in
While the requested data is being retrieved from main memory, method 300 proceeds to 306 where the pipeline enters runahead and, at 308, checkpoints the buffer including the instruction causing entry into runahead. Checkpointing the buffer preserves the state of the buffer (in checkpointed form) during runahead while a non-checkpointed version of the buffer operates in a working state. The checkpointed version is generated upon entry into runahead by the scheduler logic.
In the example shown in
At time 402, method 400 shows the generation of a checkpointed version of the buffer, the checkpointed version including the checkpointed version of the de-allocation pointer. In the checkpointed version of the buffer, the de-allocation pointer and the content of the buffer remain in the checkpointed state awaiting the end of runahead.
The embodiment of the checkpointed version of the buffer shown in
When restarting from runahead, the fetch logic is typically directed to an instruction pointer following the last complete instruction in the buffer. As introduced above, the restart instruction pointer keeps track of an address or a location following the last complete instruction in the buffer. In embodiments where the buffer holds bundles of micro-operations, a boundary pointer may be used to track a boundary between a bundle that corresponds to the restart instruction pointer and subsequent bundles, e.g., for tracking a boundary between a last complete instruction held in the buffer and a subsequent bundle. The embodiment shown in
Continuing with
In the embodiment shown in
Meanwhile, at 312 and 314, the missing data responsible for causing entry runahead is provided to the pipeline, resolving the load miss and causing runahead to end at 316. Once the initial load miss is fulfilled, the pipeline is flushed at 318, which may include discarding one or more invalid results of the speculative execution performed during runahead and, in some embodiments, the non-checkpointed version of the buffer so that the microprocessor is in the same state in which it existed when runahead commenced.
At 320, method 300 includes resetting the buffer to the checkpointed version of the buffer so that the buffer is set to the state from when the microprocessor entered runahead. Resetting the buffer to the checkpointed version at 320 includes restoring the de-allocation pointer from the checkpointed version of the buffer.
In the embodiment shown in
Continuing with
In the example shown in
Continuing with
It will be appreciated that the checkpointing schemes described above are provided for illustrative purposes only, and that virtually any suitable approach to checkpointing the buffer may be employed without departing from the scope of the present disclosure. For example, in one scenario, the buffer may be checkpointed along with the boundary, de-allocation, and allocation pointers at entry into runahead. At exit from runahead, the buffer may be restored by copying the checkpointed versions of the boundary, de-allocation, and allocation pointers from the checkpointed versions thereof, where the de-allocation pointer is also copied into the read and replay pointers so that the read, replay, and de-allocation pointers point to a common address.
It will be appreciated that methods described herein are provided for illustrative purposes only and are not intended to be limiting. Accordingly, it will be appreciated that in some embodiments the methods described herein may include additional or alternative processes, while in some embodiments, the methods described herein may include some processes that may be reordered or omitted without departing from the scope of the present disclosure. Further, it will be appreciated that the methods described herein may be performed using any suitable hardware including the hardware described herein.
This written description uses examples to disclose the invention, including the best mode, and also to enable a person of ordinary skill in the relevant art to practice the invention, including making and using any devices or systems and performing any incorporated methods. The patentable scope of the invention is defined by the claims, and may include other examples as understood by those of ordinary skill in the art. Such other examples are intended to be within the scope of the claims.
Number | Name | Date | Kind |
---|---|---|---|
5487146 | Guttag et al. | Jan 1996 | A |
5721855 | Hinton et al. | Feb 1998 | A |
5864692 | Faraboschi | Jan 1999 | A |
5870582 | Cheong et al. | Feb 1999 | A |
5956753 | Glew et al. | Sep 1999 | A |
6037946 | Takeda | Mar 2000 | A |
6484254 | Chowdhury | Nov 2002 | B1 |
6519694 | Harris | Feb 2003 | B2 |
6665792 | Merchant | Dec 2003 | B1 |
7010648 | Kadambi et al. | Mar 2006 | B2 |
7062631 | Klaiber et al. | Jun 2006 | B1 |
7117330 | Alverson et al. | Oct 2006 | B1 |
7194604 | Bigelow et al. | Mar 2007 | B2 |
7293161 | Chaudhry | Nov 2007 | B1 |
7421567 | Eickemeyer | Sep 2008 | B2 |
7587584 | Enright | Sep 2009 | B2 |
7752627 | Jones et al. | Jul 2010 | B2 |
7873793 | Rozas et al. | Jan 2011 | B1 |
7890735 | Tran | Feb 2011 | B2 |
8035648 | Wloka et al. | Oct 2011 | B1 |
8707011 | Glasco et al. | Apr 2014 | B1 |
9582280 | Kumar | Feb 2017 | B2 |
9632976 | Rozas et al. | Apr 2017 | B2 |
20030018685 | Kalafatis et al. | Jan 2003 | A1 |
20030196010 | Forin et al. | Oct 2003 | A1 |
20040128448 | Stark et al. | Jul 2004 | A1 |
20050041031 | Diard | Feb 2005 | A1 |
20050055533 | Kadambi et al. | Mar 2005 | A1 |
20050138332 | Kottapalli et al. | Jun 2005 | A1 |
20050154831 | Steely et al. | Jul 2005 | A1 |
20060010309 | Chaudhry et al. | Jan 2006 | A1 |
20060095678 | Bigelow et al. | May 2006 | A1 |
20060149931 | Haitham et al. | Jul 2006 | A1 |
20060174228 | Radhakrishnan et al. | Aug 2006 | A1 |
20060179279 | Jones et al. | Aug 2006 | A1 |
20060212688 | Chaudhry et al. | Sep 2006 | A1 |
20060277398 | Akkary et al. | Dec 2006 | A1 |
20070074006 | Martinez et al. | Mar 2007 | A1 |
20070174555 | Burtscher et al. | Jul 2007 | A1 |
20070186081 | Chaudhry et al. | Aug 2007 | A1 |
20070204137 | Tran | Aug 2007 | A1 |
20090019317 | Quach et al. | Jan 2009 | A1 |
20090327661 | Sperber et al. | Dec 2009 | A1 |
20100199045 | Bell et al. | Aug 2010 | A1 |
20100205402 | Henry | Aug 2010 | A1 |
20100205415 | Henry | Aug 2010 | A1 |
20110264862 | Karlsson et al. | Oct 2011 | A1 |
20120023359 | Edmeades et al. | Jan 2012 | A1 |
20120089819 | Chaudhry | Apr 2012 | A1 |
20130124829 | Chou et al. | May 2013 | A1 |
20140082291 | Van Zoeren et al. | Mar 2014 | A1 |
20140122805 | Ekman et al. | May 2014 | A1 |
20140136891 | Holmer et al. | May 2014 | A1 |
20140164736 | Rozas | Jun 2014 | A1 |
20140164738 | Ekman et al. | Jun 2014 | A1 |
20140281259 | Klaiber et al. | Sep 2014 | A1 |
20150026443 | Kumar et al. | Jan 2015 | A1 |
20170199778 | Ekman et al. | Jul 2017 | A1 |
Number | Date | Country |
---|---|---|
1519728 | Aug 2004 | CN |
1629799 | Jun 2005 | CN |
1831757 | Sep 2006 | CN |
102184127 | Sep 2011 | CN |
103793205 | May 2014 | CN |
102013218370 | Mar 2014 | DE |
0671718 | Sep 1995 | EP |
2287111 | Sep 1995 | GB |
200405201 | Apr 2004 | TW |
200529071 | Sep 2005 | TW |
1263938 | Oct 2006 | TW |
1275938 | Mar 2007 | TW |
200723111 | Jun 2007 | TW |
200809514 | Feb 2008 | TW |
I315488 | Oct 2009 | TW |
201032627 | Sep 2010 | TW |
201112254 | Apr 2011 | TW |
I425418 | Feb 2014 | TW |
1536167 | Jun 2016 | TW |
Entry |
---|
Dundas, James and Trevor Mudge., Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss. Proc. 1999 ACM Int. Conf. on Supercomputing, Jul. 1997, Dept. of Electrical Engineering and Computer Science, University of Michigan, 9 pages. |
Mutlu, Onur et al., “Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors,” This paper appears in: “High-Performance Computer Architecture,” Feb. 8-12, 2003, 13 pages. |
Chaudry, S. et al., “High-Performance Throughput Computing,” Micro, IEEE 25.3, pp. 32-45, May 2005, 14 pages. |
Rozas, Guillemo J. et al., “Queued Instruction Re-Dispatch After Runhead,” U.S. Appl. No. 13/730,407, filed Dec. 28, 2012, 36 pages. |
Adve, S. et al., “Shared Memory Consistency models: A Turorial”, WRL Research Report 95/7, Western Digital Laboratory, Sep. 1995, 32 pages. |
Dehnert, et al., The Transmeta Code Morphing Software: using speculation, recovery, and adaptive retranslation to address real-life challenges, Mar. 23, 2003, IEEE, CGO '03 Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization, pp. 15-24. |
Ekman, M. et al., “Instruction Categorization for Runahead Operation”, U.S. Appl. No. 13/708,544, filed Dec. 7, 2012, 32 Pages. |
Ekman, M. et al., “Selective Poisoning of Data During Runahead”, U.S. Appl. No. 13/662,171, filed Oct. 26, 2012, 33 Pages. |
Holmer, B., et al., “Managing Potentially Invalid Results During Runahead”, U.S. Appl. No. 13/677,085, filed Nov. 14, 2012, 29 pages. |
Intel Itanium Architecture Software Developer's Manual, Intel, http://www.intel.com/design/itanium/manuals/iasdmanual.htm, 1 page. |
Nvidia Corp. Akquirierung spekulativer Genehmigung jur gemeinsam genutzten Speicher, Mar. 20, 2014, SW102013218370 A1, German Patent Office, All Pages. |
Rozas, J. et al., “Lazy Runahead Operation for a Microprocessor”, U.S. Appl. No. 13/708,645, filed Dec. 7, 2012, 32 pages. |
Wikipedia article, “Instruction Prefetch,” https://en.wikipedia.org/wiki/Instruction—prefetch, downloaded May 23, 2016. |
Wikipedia article, “x86,” https://en.wikipedia.org/wiki/X86, downloaded May 23, 2016. |
Altera-“Implementing a Queue Manager in Traffic Management Systems”, Feb. 2004, “pp. 1-8”. |
Number | Date | Country | |
---|---|---|---|
20130297911 A1 | Nov 2013 | US |