1. Field
The described embodiments relate to computer systems. More specifically, the described embodiments relate to techniques for limiting speculative instruction fetching in a processor.
2. Related Art
Some modern microprocessors support speculatively executing program code. Although these processors can support a number of different speculative-execution modes, the speculative execution generally involves speculatively executing instructions while preserving a pre-speculation architectural state to enable the processor to discard speculative results and return to the pre-speculation architectural state in the event that a predefined operating condition (e.g., encountering an error/trap, a coherence violation, or a lack of resources, or executing certain types of instructions) occurs during speculative execution.
In some processors, upon encountering certain “pipe-clearing events” while speculatively executing instructions, the processor immediately terminates speculative execution, restores the pre-speculation architectural state, and resumes normal (i.e., non-speculative) execution. Generally, these pipe-clearing events are triggered by certain operating conditions or instructions encountered by the processor for which some of the stages of the instruction execution pipeline in the processor are cleared. For example, one type of pipe-clearing event is a pipe-clearing instruction, which is an instruction that is implemented such that the instruction always triggers a pipe-clear. When a pipe-clearing instruction is executed, the processor flushes the pipeline stages behind the instruction when the instruction has progressed through the pipeline stages and completed execution (e.g., instructions such as the DONE, RETRY, WRPR, or WRY instructions in the SPARC processor architecture from SPARC International Inc., of Menlo Park, Calif., USA). Another example of a pipe-clearing event is an error condition or a trap. For example, upon encountering a divide instruction for which the denominator is zero, the processor clears the pipe and handles a divide-by-zero trap.
Because the processor terminates speculative execution and restores the pre-speculation architectural state upon encountering a pipe-clearing event, the processor can encounter inefficiencies. One such inefficiency occurs when the processor issues prefetches or memory requests (e.g., TLB requests or requests for cache lines) while speculatively executing instructions and then encounters a pipe-clearing event. Because the pipe-clearing event causes the processor to immediately terminate speculative execution and restore the pre-speculation architectural state, any data subsequently returned by the prefetches, translation requests, or memory requests generated during speculative execution is discarded by the processor without being used to update the processor state. Thus, the computational work done by the processor's memory system in accessing and returning this data is wasted.
This effect can be worsened in processors where the processor restarts speculative execution if a condition that originally caused the processor to start speculative execution has not been resolved when the processor terminates speculative execution due to a pipe-clearing event. In these processors, if the condition has not been resolved, the processor can immediately restart speculative execution and again encounter the same (or a subsequent) pipe-clearing event. Each time the processor speculatively re-executes the same instructions, the processor can generate the same prefetch and memory requests. Thus, the processor can speculatively re-execute the same instructions multiple times, potentially flooding the memory system with numerous redundant requests.
The described embodiments relate to a processor that speculatively executes instructions. During operation, the processor executes instructions in a speculative-execution mode. Upon detecting an impending pipe-clearing event while executing instructions in the speculative-execution mode, the processor stalls an instruction fetch unit to prevent the instruction fetch unit from fetching additional speculative instructions and thereby potentially generating additional memory requests (e.g., cache line prefetches, TLB page entry requests, etc.).
In some embodiments, the processor stalls the instruction fetch unit until a condition that originally caused the processor to operate in the speculative-execution mode is resolved.
In some embodiments, the processor stalls the instruction fetch unit until the pipe-clearing event has been completed (i.e., has been handled in the processor).
In some embodiments, before releasing the stall, the processor restores the processor to a preserved architectural state that existed just before commencing executing instructions in the speculative-execution mode. In these embodiments, the processor can then resume operation in a normal-execution mode.
In some embodiments, when speculatively executing instructions, the processor uses one strand from a set of two or more strands on the processor to speculatively execute the instructions. In these embodiments, upon detecting an impending pipe-clearing event for the strand, the processor performs a per-strand stall of the instruction fetch unit to prevent the instruction fetch unit from fetching instructions for the strand and thereby potentially generating additional memory requests (e.g., cache line prefetches, TLB page entry requests, etc.).
In some embodiments, the processor releases the stall when an instruction or operating condition prevents the pipe-clearing event from being completed.
In some embodiments, speculatively executing instructions involves executing instructions in a scout mode, an execute-ahead mode, a branch-prediction mode, or another speculative-execution mode.
In some embodiments, the pipe-clearing event is caused by at least one of: (1) a pipe-clearing instruction; (2) a trap condition; (3) a miss for a delay slot for a branch; or (4) an operating condition that will cause the clearing of one or more stages of a pipeline in the processor.
In the figures, the same reference numerals refer to the same figure elements.
The following description is presented to enable any person skilled in the art to make and use the described embodiments, and is provided in the context of a particular application and its requirements. Various modifications to the described embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the described embodiments. Thus, the described embodiments are not limited to the embodiments shown, but are to be accorded the widest scope consistent with the principles and features disclosed herein.
The data structures and code described in this detailed description are typically stored on a computer-readable storage medium, which may be any device or medium that can store code and/or data for use by a computer system. The computer-readable storage medium includes, but is not limited to, volatile memory and non-volatile memory, such as magnetic and optical storage devices such as disk drives, magnetic tape, CDs (compact discs), DVDs (digital versatile discs or digital video discs), or other media capable of storing data structures or code.
The methods and processes described in this detailed description can be included in hardware modules. For example, the hardware modules can include, but are not limited to, application-specific integrated circuit (ASIC) chips, field-programmable gate arrays (FPGAs), and other programmable-logic devices now known or later developed. When the hardware modules are activated, the hardware modules perform the methods and processes included within the hardware modules. In some embodiments, the hardware modules include one or more general-purpose circuits that are configured by executing instructions to perform some or all of the methods and processes.
In the described embodiments, processor 102 (see
In the described embodiments, processor 102 also supports one or more speculative-execution modes. For example, processor 102 can support a scout mode, wherein instructions are speculatively executed to generate prefetches, but the results from the instructions (if any) are not committed to the architectural state. In addition, processor 102 can support an execute-ahead mode or a branch-prediction mode wherein instructions are speculatively executed during a data-dependent stall condition or a branch prediction to enable processor 102 to perform useful computational work until the data returns or until the branch resolution can be computed.
Note that, although we use these speculative-execution modes as examples, the described embodiments can support any of a number of different speculative-execution modes. Generally, speculative execution involves speculatively executing instructions while preserving a pre-speculation architectural state to enable the processor to discard speculative results and return to the pre-speculation architectural state in the event that an error condition occurs during speculative execution.
In the described embodiments, processor 102 monitors instruction execution for an indication that a pipe-clearing event is going to occur during speculative execution. For example, processor 102 can detect pipe-clearing instructions in the earliest stage in pipeline 112 where the pipe-clearing instruction can be clearly identified (e.g., a decode stage in pipeline 112), but before the pipe-clearing instruction has caused pipeline 112 stages to be cleared. In addition, processor 102 can detect an operating condition that will cause a trap and/or an error in a later stage (e.g., a commit/trap stage) of pipeline 112. For example, processor 102 can detect a division operation for which the denominator is zero in an execution stage and recognize the division operation as the source of a divide-by-zero trap in a commit/trap stage later in pipeline 112. Alternatively, processor 102 can detect a load miss for a delay slot for a branch.
In the described embodiments, upon determining that a pipe-clearing event is going to occur, processor 102 stalls a fetch unit in a pipeline in processor 102 to prevent fetch unit 120 from fetching instructions and thereby potentially generating additional memory requests (e.g., cache line prefetches, TLB page entry requests, etc.). Processor 102 maintains the stall of fetch unit 120 until either: (1) after the pipe-clearing event has been handled in pipeline 112, or (2) a stall condition that caused processor 102 to start speculatively executing instructions is resolved (i.e., until the speculative execution episode is complete). Processor 102 then terminates the scout mode, restores a pre-scout-mode architectural state of processor 102 (i.e., restores a checkpoint), and releases the stall of the fetch unit, enabling the fetch unit to continue fetching instructions in the normal-execution mode.
In these embodiments, by recognizing that the pipe-clearing event is going to occur and stalling the fetch unit during speculative execution, processor 102 avoids fetching (and potentially executing) instructions that will eventually be flushed from pipeline 112 when some or all of the stages of pipeline 112 is cleared when the pipe-clearing event is handled. This facilitates processor 102 avoiding sending memory requests (TLB page entry requests, cache line requests, etc.) when the underlying instruction is eventually going to be flushed from pipeline 112 during the pipe-clearing event, thereby rendering the value returned in response to the request useless. By avoiding sending the memory requests, these embodiments can avoid the situation that occurs in existing systems where a pipe-clearing event repeatedly causes a speculative episode to fail and be restarted, needlessly sending the same memory request(s) numerous times.
Processor 102 can include any device that is configured to perform computational operations. For example, processor 102 can be a central processing unit (CPU) such as a microprocessor, a controller, or an application-specific integrated circuit.
Mass-storage device 110, memory 108, L2 cache 106, and L1 cache 104 are computer-readable storage media that collectively form a memory hierarchy in a memory subsystem that stores data and instructions for processor 102. Generally, mass-storage device 110 is a high-capacity non-volatile memory, such as a disk drive or a large flash memory, with a large access time, while L1 cache 104, L2 cache 106, and memory 108 are smaller, faster semiconductor memories that store copies of frequently used data. For example, memory 108 can be a dynamic random access memory (DRAM) structure that is larger than L1 cache 104 and L2 cache 106, whereas L1 cache 104 and L2 cache 106 can include smaller static random access memories (SRAM). In some embodiments, L2 cache 106, memory 108, and mass-storage device 110 are shared among one or more processors in computer system 100. In addition, in some embodiments, L1 cache 104 comprises two separate caches, an instruction cache (see, e.g., I-cache 206 in
In addition to the memory hierarchy, processor 102 includes one or more translation lookaside buffers (TLBs) (not shown in
Computer system 100 can be incorporated into many different types of electronic devices. For example, computer system 100 can be part of a desktop computer, a laptop computer, a server, a media player, an appliance, a cellular phone, a piece of testing equipment, a network appliance, a calculator, a personal digital assistant (PDA), a hybrid device (i.e., a “smart phone”), a guidance system, a toy, audio/video electronics, a video game system, a control system (e.g., an automotive control system), or another electronic device.
Although we use specific components to describe computer system 100, in alternative embodiments, different components can be present in computer system 100. For example, computer system 100 can include video cards, network cards, optical drives, network controllers, I/O devices, and/or other peripheral devices that are coupled to processor 102 using a bus, a network, or another suitable communication channel. Alternatively, computer system 100 may include more or fewer of the elements shown in
Note that, for clarity and brevity, in the following description, we use scout mode to describe the embodiments. However, the described embodiments are also operable with other forms of speculative execution. Generally, the described embodiments are operable with any form of speculative execution wherein, upon detecting a pipe-clearing event, processor 102 terminates speculative execution, restores a preserved pre-speculation architectural state, and resumes operation. The described embodiments facilitate preventing processor 102 and/or computer system 100 from sending needless memory requests (e.g., TLB instruction translation requests, cache line requests, etc.) during any mode of speculative execution once it becomes clear that a pipe-clearing event will cause the pre-speculation architectural state to be restored, rendering outstanding speculative memory requests useless.
As shown in
Generally, pipeline 112 is an instruction execution pipeline that includes a number of stages for executing program code. Within pipeline 112, fetch unit 120 fetches instructions from L1 cache 104 (or from other levels of the memory hierarchy) for execution. Next, decode unit 122 decodes the fetched instructions and prepares the instructions for execution in execution unit 124.
Execution unit 124 then executes the instructions forwarded from decode unit 122. Note that execution unit 124 can include one or more floating point execution units, integer execution units, branch execution units, and/or memory execution units (e.g., load-store units). Finally, commit/trap unit 126 retires successfully executed instructions (i.e., committing the results to the architectural state of processor 102 and computer system 100) and handles traps/errors that arise during the execution of instructions.
Note that in some embodiments, fetch unit 120 includes two separate pipeline stages. In these embodiments, the first stage is a TLB translation stage where fetch unit 120 translates a virtual instruction address into a physical address, and the second stage is an access to an instruction cache to retrieve the instruction using the translated address. However, for clarity, we include both sub-stages within the single fetch unit 120.
Note that pipeline 112 is simplified for the purposes of illustration. In alternative embodiments, pipeline 112 can contain other stages (units), functional blocks, mechanisms, and/or circuits. The units, functional blocks, mechanisms, and/or circuits that can be used in a pipeline are known in the art and hence are not described in more detail.
In the described embodiments, stall control 128 in fetch unit 120 includes one or more mechanisms for stalling fetch unit 120 to prevent instructions from being fetched. In some embodiments, while fetch unit 120 is stalled, pipeline 112 continues to run (i.e., the circuits in pipeline 112 are active), but no newly fetched instructions are forwarded from fetch unit 120 into decode unit 122. In some embodiments, while stalled, fetch unit 120 feeds no-ops (NOOPs) into pipeline 112.
In alternative embodiments, the output of fetch unit 120 continues to be set at a last fetched instruction value as long as fetch unit 120 is stalled. In these embodiments, pipeline 112 can simply continue to run, allowing whatever instruction is present on the output of fetch unit 120 to flow through pipeline 112 without allowing the results to affect the architectural state of processor 102. In the embodiments where pipeline 112 continues to pick up the last fetched instruction from fetch unit 120, processor 102 can include one or more mechanisms for preventing the results of instructions from affecting the architectural state of processor 102 while fetch unit 120 is stalled (not shown).
As shown in
In the embodiments shown in
In the embodiments shown in
In these embodiments, each of the units in pipeline 112 includes one or more mechanisms for detecting an indication of an impending pipe-clearing event. For example, execution unit 124 can include one or more mechanisms for determining when an instruction will generate a trap in the commit/trap unit 126. As another example, decode unit 122 can include one or more mechanisms for monitoring instruction decode to determine when a “pipe-clearing” instruction has been decoded.
In the embodiment shown in
In the embodiment shown in
Note that while instruction fetch unit 120 is stalled and hence not fetching instructions, instruction fetch unit 120 also does not send ITLB requests, cache line requests, or other types of memory requests based on the fetched instructions. This can facilitate the described embodiments avoiding sending duplicative requests based on repeatedly executed instructions. In addition, because instruction fetch unit 120 is stalled, other memory requests, such as data cache line requests or data translation lookaside buffer (DTLB) requests from the load-store execution unit in execution unit 124, are avoided because the fetch unit does not forward instructions to be decoded and executed.
In some embodiments, processor 102 includes a checkpoint-generation mechanism (not shown). This checkpoint-generation mechanism includes one or more register files, memories, tables, lists, or other structures that facilitate saving a copy of the architectural state of processor 102. In these embodiments, when commencing speculative execution, the checkpoint-generation mechanism can perform operations to checkpoint the architectural state of processor 102. Generally, the architectural state includes copies of all structures, memories, registers, flags, variables, counters, etc., that are useful or necessary for restarting processor 102 from the pre-speculation architectural state. Note that the checkpoint-generation mechanism may not immediately copy values to preserve the pre-speculation architectural state. In some embodiments, the state is only preserved as necessary. For example, before a register, counter, variable, etc. is overwritten or changed during speculative execution, the checkpoint-generation mechanism can preserve a copy. In some embodiments, the checkpoint-generation mechanism is distributed among one or more of the sub-blocks of processor 102.
Computer system 100 also includes functional blocks, circuits, and hardware for operating in a scout mode. Exemplary embodiments of a system that supports scout mode are described in U.S. Pat. Pub. No. 2004/0133769, entitled “Generating Prefetches by Speculatively Executing Code Through Hardware Scout Threading,” by inventors Shailender Chaudhry and Marc Tremblay, which is hereby incorporated by reference to describe scout mode mechanisms. Note that although we provide this reference as an example of a system that supports scout mode, numerous other references describe additional aspects of scout mode operation. See, for example, U.S. Pat. No. 7,529,911, entitled “Hardware-Based Technique for Improving the Effectiveness of Prefetching During Scout Mode,” by inventors Lawrence Spracklen, Yuan Chou, and Abraham Santosh, which is hereby incorporated by reference to describe scout mode mechanisms, along with other conference papers, patent publications, and issued patents.
In addition, in some embodiments, computer system 100 can include mechanisms (not shown) for operating in other speculative-execution modes. For example, processor 102 can include mechanisms for supporting execute-ahead mode, speculative branch prediction, and/or other forms of speculative execution.
Fetch request generator 202 generates fetch addresses for fetching cache lines containing instructions from I-cache 206. Generally, during operation fetch request generator 202 generates addresses for fetching cache lines in sequence from I-cache 206. However, instructions that cause processor 102 to jump or branch to another location in program code can cause fetch request generator 202 to generate addresses for fetching cache lines from corresponding and potentially non-sequential addresses in I-cache 206. Techniques for generating addresses for fetching cache lines are known in the art and are not described in more detail.
ITLB 204 is a lookup structure used by fetch unit 120 for translating virtual addresses of cache lines of instructions into the physical addresses where the cache lines are actually located in memory. ITLB 204 has a number of slots that contain page table entries that map virtual addresses to physical addresses. In some embodiments ITLB 204 is a content-addressable memory (CAM), in which the search key is the virtual address and the search result is a physical address. Generally, if a requested virtual address is present in ITLB 204 (an “ITLB hit”), ITLB 204 provides the corresponding physical address, which is then used to attempt to fetch the cache line from I-cache 206. Otherwise, if the virtual address is not present in ITLB 204 (an “ITLB miss”), the translation is performed using a high-latency “page walk,” which involves computing the physical address using one or more values retrieved from the memory subsystem. Note that in some embodiments, processor 102 requests the page entry from one or more higher levels of TLB (not shown) before performing the page walk.
ITLB miss buffer 208 is a memory that includes a number of entries for recording ITLB misses. In the described embodiments, when a virtual address lookup misses in ITLB 204, a request is sent to the memory subsystem to perform a page walk. The request is recorded in an entry in ITLB miss buffer 208. When a physical address is returned in response to an outstanding request, ITLB 204 can be updated and the entry in the ITLB miss buffer 208 can be cleared or invalidated.
I-cache 206 is a cache memory that stores a number of cache lines containing instructions. Generally, a request for a given cache line address can be sent to I-cache 206 to perform a lookup for the cache line. If the cache line is present in I-cache 206 (a “hit”), the cache line can be returned in response to the request. Otherwise, if the cache line is not present in I-cache 206 (a “miss”), the request can be forwarded to the next level in the memory subsystem.
I-cache miss buffer 210 is a memory that includes a number of entries for recording I-cache misses. In the described embodiments, when a lookup for a cache line misses in I-cache 206, a request is sent to the memory subsystem for the cache line. The request is recorded in an entry in I-cache miss buffer 210. When a cache line is returned in response to an outstanding request, I-cache 206 can be updated and the entry in the I-cache miss buffer 210 can be cleared or invalidated.
Fetch request generator 202 includes two output signals: FETCH_ADDRESS and VALID_FETCH. The FETCH_ADDRESS signal is a multi-bit signal (e.g., 24, 32, or 64 bits) for signaling an address that is to be used to fetch a cache line containing instructions from I-cache 206. In some embodiments, the address on FETCH_ADDRESS can be: (1) a physical address, which can be used to retrieve cache lines from I-cache 206 directly, or (2) a virtual address, which is translated into a physical address in ITLB 204 before the physical address is used to retrieve cache lines from I-cache 206.
The VALID_FETCH signal is a single-bit signal that can be used by fetch request generator 202 to indicate that the signal on FETCH_ADDRESS is a valid address signal that can be used to perform address translations and/or retrieve cache lines.
Flop 212 is a circuit element (e.g., a set-reset (SR) flip-flop) that is used to control whether fetch unit 120 is stalled and therefore not fetching instructions, or operating normally and therefore fetching instructions. When the FETCH_ENABLE of flop 212 is deasserted (i.e., a logical “0”), AND gate 214 deasserts the VALID_FETCH output, which prevents ITLB 204 from translating a virtual address on FETCH_ADDRESS to a physical address, and prevents I-cache 206 from returning cache lines based on FETCH_ADDRESS and/or the physical address from ITLB 204.
As shown in
In some embodiments, the units in processor 102 are configured so that the SET and RESET signals cannot be asserted at the same time to avoid race conditions between the SET and RESET signals. For example, in some embodiments, the units in processor 102 can assert the SET and RESET signals only for a predetermined time (i.e., the SET and RESET signals are strobed).
Although we describe embodiments with separate SET and RESET signals, alternative embodiments can use different mechanisms and signals and/or combinations of signals to cause fetch unit 120 to stall in response to detecting an impending pipe-clearing event. For example, in some alternative embodiments, the SET and RESET signals are replaced by a single STALL_DISABLE signal (not shown). In these embodiments, upon detecting an indicator of a pipe-clearing event, a given unit in pipeline 112 can deassert (i.e., set to a logical “0”) the STALL_DISABLE signal. In these embodiments, the STALL_DISABLE signal can then be held deasserted by a weak keeper (not shown). When the unit or another unit in pipeline 112 subsequently determines that the pipe-clearing event has been resolved, the STALL_DISABLE signal can be asserted (thereby overriding the weak keeper) to release the stall. The STALL_DISABLE signal can then be held in the asserted state by the weak keeper. In these embodiments, the STALL_DISABLE signal can be directly fed to AND gate 214 to enable the stall control.
In addition, in some embodiments, flop 212 is replaced with another circuit element for controlling the stall of instruction fetch unit 120. As one possible example, using the single STALL_DISABLE signal described above, some embodiments can use an enable flop, which is a flop with a single data input and an enable signal. In these embodiments, by deasserting the enable signal the flop can be disabled to prevent the stall from occurring. Otherwise, the enable flop passes the value on the STALL_DISABLE signal to AND gate 214.
The blocks and functional elements shown in
In the described embodiments, processor 102 supports scout mode. Generally, scout mode is a form of speculative execution during which processor 102 executes program code to prefetch cache lines and update other processor state, but does not commit results to the architectural state of processor 102.
While operating in scout mode, processor 102 speculatively executes the code from the point of the stall without committing results of the speculative execution to the architectural state of processor 102. If processor 102 encounters a memory reference while operating in scout mode, processor 102 determines if a target address for the memory reference can be resolved. If so, processor 102 issues a prefetch for the memory reference to load a cache line for the memory reference into a cache (e.g., L1 cache 104). In addition to issuing prefetches for cache lines, processor 102 can update other memories or processor structures for which an update can be resolved. For example, processor 102 can update translation lookaside buffer (TLB) page table entries, branch predictions in a branch prediction unit, and/or other memories or tables.
While processor 102 is operating in scout mode, upon detection by one of the units in pipeline 112 of an impending “pipe-clearing” event (step 304), the unit in pipeline 112 can stall fetch unit 120 to prevent fetch unit 120 from fetching instructions during the pipe-clearing event (for details about how fetch unit 120 is stalled, and the stall on fetch unit 120 is released, see
Generally, a pipe-clearing event is a condition encountered by the processor for which some or all of pipeline 112 stages are cleared. For example, one type of pipe-clearing event is a “pipe-clearing” instruction, which includes any instruction for which the processor clears pipeline stages behind the instruction and keeps pipeline 112 stages behind the instruction clear until the instruction completes execution (e.g., the DONE, RETRY, WRPR, or WRY instructions in the SPARC processor architecture from SPARC International Inc., of Menlo Park, Calif., USA). Another example of a pipe-clearing event includes error conditions or traps. For example, upon encountering a divide instruction for which the denominator is zero, the processor generates a divide-by-zero trap, which clears pipeline 112 and branches to trap-handling code.
When monitor unit 130 determines that the stall condition has been resolved, the processor clears pipeline 112, restores the checkpoint, and resumes execution in a normal-execution mode (step 306). Note that restoring the checkpoint and resuming execution in the normal-execution mode can involve re-executing some or all of the instructions executed during scout mode.
In these embodiments, executing in scout mode past the stall condition enables processor 102 to “warm up” caches, memories, tables, and other processor structures using speculatively executed instructions. The warmed-up structures can subsequently be used to enable more efficient execution upon resuming execution in the normal-execution mode.
Although we use scout mode as an exemplary speculative-execution mode, alternative embodiments of processor 102 support different speculative-execution modes (e.g., execute-ahead mode, speculative branch prediction, etc.). The alternative embodiments operate in a similar way to the described embodiments.
Note that the described embodiments differ from existing systems that support scout mode in that processor 102 does not immediately terminate scout mode in the event that a pipe-clearing event occurs in processor 102. Instead, in these embodiments, processor 102 detects an impending pipe-clearing event and stalls fetch unit 120 at least until the pipe-clearing event is handled. In some of these embodiments, processor 102 can eventually resume operation in scout mode (see, e.g., the embodiment shown in
Note also that the described embodiments support a normal-execution mode, in which program code is executed in program order. Normal execution (i.e., operating in the normal-execution mode) is known in the art and hence is not described in detail.
In the embodiment shown in
Note that while instruction fetch unit 120 is stalled and hence not fetching instructions, instruction fetch unit 120 also does not send TLB requests, cache line requests, or other types of memory requests based on the fetched instructions. In addition, because instruction fetch unit 120 is stalled, the load-store execution unit in execution unit 124 does not send memory requests such as data cache line requests or data translation lookaside buffer (DTLB) requests (because instruction fetch unit 120 is stalled and therefore not forwarding instructions to be decoded and executed). This can facilitate the described embodiments avoiding sending duplicative memory requests based on repeatedly executed instructions.
Recall that as described above, the processes shown in
As shown in
Note that, although we use the SET and RESET signals to describe these embodiments, in alternative embodiments, different signals and/or mechanisms can be used to stall fetch unit 120.
Upon monitor unit 130 determining that the stall condition that caused processor 102 to start operating in scout mode has been resolved, monitor unit 130 releases the stall on fetch unit 120, and signals processor 102 to restore the checkpoint and resume executing instructions in the normal-execution mode (step 402). In other words, upon monitor unit 130 detecting the resolution of the stall condition that caused processor 102 to start a scout mode episode, monitor unit 130 causes processor 102 to end the scout mode episode as described above. However, in addition to performing the operations for ending the scout mode episode, monitor unit 130 in processor 102 also asserts the SET signal to release the stall on fetch unit 120. In these embodiments, the stall of fetch unit 120 is released in conjunction with the resumption of the operation in the normal-execution mode so that fetch unit 120 begins fetching instructions for execution in normal-execution mode.
As shown in
Upon determination by commit/trap unit 126 that the pipe-clearing event has been handled, commit/trap unit 126 asserts the SET signal to release the stall on fetch unit 120 and cause processor 102 to restore the checkpoint, and resume executing instructions in the normal-execution mode (step 502). In other words, upon commit/trap unit 126 detecting that the pipe-clearing event has been handled, processor 102 ends the scout mode episode as described above. In these embodiments, commit/trap unit 126 releases the stall of fetch unit 120 in conjunction with the resumption of the operation in the normal-execution mode so that fetch unit 120 begins fetching instructions for execution in normal-execution mode.
In these embodiments, a pipe-clearing event has been handled as soon as the instruction, operating condition, or error that was detected by a unit in pipeline 112 as indicating an impending pipe-clearing event has caused the pipe-clearing event and/or retired from pipeline 112. For example, assuming that the pipe-clearing event is caused by a pipe-clearing instruction, the pipe-clearing event is considered handled as soon as the pipe-clearing instruction passes the commit/trap unit 126 and is no longer in pipeline 112. As another example, when an instruction that caused a trap condition has been processed in commit/trap unit 126 (i.e., has caused a flush of pipeline 112 and is finished with processing), the pipe-clearing event is considered handled.
In some of the embodiments described with respect to
In these embodiments, the SET signal can be asserted whether there is an outstanding stall or not (note that the assertion of the SET signal if flop 212 is already “set” has no effect). In other words, in these embodiments, the unit in pipeline 112 automatically asserts the SET signal to ensure that any stall of fetch unit 120 is released before returning to normal-execution mode.
In some embodiments, processor 102 includes mechanisms for executing two or more strands (where a “strand” includes hardware mechanisms for executing a software thread). In some of these embodiments, fetch unit 120 includes mechanisms for controlling the fetching of instructions on a per-strand basis. These embodiments can include mechanisms to enable the stalling of fetch unit 120 for a given strand (i.e., a “per-strand” stall of the fetch unit). In these embodiments, upon detecting an impending pipe-clearing event for a given strand, the units in pipeline 112 can stall fetch unit 120 to prevent fetch unit 120 from fetching subsequent instructions for the corresponding strand until the pipe-clearing event has been handled for that strand. The per-strand stall can then be released, enabling fetch unit 120 to resume fetching instructions for the strand.
In some embodiments, processor 102 supports one or more additional forms of speculative execution. Generally, these forms of speculative execution include any operational mode wherein processor 102 executes instructions speculatively and can return to a prior operating state (i.e., can return to a checkpoint, rewind a transaction, etc.) upon encountering a pipe-clearing event. For example, these speculative-execution modes can include execute-ahead mode, speculative branch-prediction mode, transactional execution mode, or other such speculative-execution modes.
Embodiments of processor 102 that support other forms of speculative execution operate in a similar way to the above-described embodiments. Specifically, upon detecting that a predetermined pipe-clearing event is going to occur, these embodiments can stall fetch unit 120 to prevent instructions from being fetched and executed until the pipe-clearing event has been handled by processor 102. Once the pipe-clearing event has been handled, these embodiments can release the stall on fetch unit 120 to enable the fetch unit to resume fetching instructions.
Some of the described embodiments include one or more mechanisms for releasing the stall of the fetch unit in case an event occurs that would otherwise prevent the processor from completing the execution of the pipe-clearing event and releasing the stall. For example, in the case of a branch misprediction, processor 102 immediately flushes pipeline 112 and jumps to a new location in the program code to begin fetching subsequent instructions. Because pipeline 112 flushes can remove the instruction and/or operating condition that caused a pending pipe-clearing event from pipeline 112, commit/trap unit 126 may never encounter/handle the pipe-clearing event, and thus may never release the stall on fetch unit 120 (deadlocking the processor). Thus, in some of the described embodiments, upon encountering the branch mispredict, processor 102 (e.g., monitor unit 130 or one of the units in pipeline 112) releases the stall from fetch unit 120. Although we use branch misprediction to describe these embodiments, other embodiments immediately release the stall based on other operating conditions, such as, but not limited to, the return of another pipe-clearing event.
The foregoing descriptions of embodiments have been presented only for purposes of illustration and description. They are not intended to be exhaustive or to limit the embodiments to the forms disclosed. Accordingly, many modifications and variations will be apparent to practitioners skilled in the art. Additionally, the above disclosure is not intended to limit the embodiments. The scope of the embodiments is defined by the appended claims.