Disclosed aspects relate to the restart of an instruction pipeline included in a microprocessor. More particularly, some aspects are directed to the restart of the instruction pipeline of a processor that includes a decoupled fetcher.
Conditional execution of instructions is a conventional feature of processing systems. An example is a conditional instruction, such as a conditional branch instruction, where the direction taken by the conditional branch instruction may depend on how a condition gets resolved. For example, a conditional branch instruction may be represented as, “if <condition1> jump1),” wherein, if condition) evaluates to true, then operational flow of instruction execution jumps to a target address specified by the jump1 label (this scenario may also be referred to as the branch instruction (jump1) being “taken”). On the other hand, if condition1 evaluates to false, then the operational flow may continue to execute the next sequential instruction after the conditional branch instruction, without jumping to the target address. This scenario is also referred to as the branch instruction not being taken, or being “not-taken”. Under certain instruction set architectures (ISAs), instructions other than branch instructions may be conditional, where the behavior of the instruction would be dependent on the related condition.
In general, the manner in which the condition of a conditional instruction will be resolved will be unknown until the conditional instruction is executed. Waiting until the conditional instruction is executed to determine the condition can impose undesirable delays in modern processors which are configured for parallel and out-of-order execution. The delays are particularly disruptive in the case of conditional branch instructions, because the direction in which the branch instruction gets resolved will determine the operational flow of instructions which follow the branch instruction.
In order to improve instruction level parallelism (ILP) and minimize delays, modern processors may include mechanisms to predict the resolution of the condition of conditional instructions prior to their execution. For example, branch prediction mechanisms are implemented to predict whether the direction of the conditional branch instruction will be taken or not-taken before the conditional branch instruction is executed. If the prediction turns out to be erroneous, the instructions which were incorrectly executed based on the incorrect prediction will be flushed. This results in a penalty known as the branch misprediction penalty. If the prediction turns out to be correct, then no branch misprediction penalty is encountered.
Branch prediction mechanisms may be static or dynamic. Branch prediction itself adds latency to a pipeline, otherwise known as the branch prediction penalty. When an instruction is fetched from an instruction cache and processed in an instruction pipeline, branch prediction mechanisms must determine whether the instruction that is fetched is a conditional instruction and whether it is a branch instruction and then make a prediction on the likely direction of the conditional branch instruction. It is desirable to minimize stalls or bubbles related to the process of branch prediction in an instruction execution pipeline. Therefore, branch prediction mechanisms strive to make a prediction as early in an instruction pipeline as possible.
In the pursuit of instruction-level parallelism, significant demands are placed on a processor's instruction delivery mechanism. Delivering the performance necessary to meet future processor execution targets requires that the performance of the instruction delivery mechanism scale with the execution core. Attaining these targets is a challenging task due to I-cache misses, branch mispredictions, and taken branches in the instruction stream. To counter these challenges, some architectures decouple the branch predictor from the instruction fetch unit. A Fetch Target Queue (FTQ) is inserted between the branch predictor and instruction cache. This allows the branch predictor to run far in advance of the address currently being fetched by the cache. The decoupling enables a number of architecture optimizations, including multilevel branch predictor design, fetch-directed instruction prefetching, and easier pipelining of the instruction cache.
For example, some modern microprocessors may decouple instruction fetching from fetch address generation (including branch prediction), allowing fetch address generation to run-ahead and enqueue many future fetch addresses in a decoupling queue (e.g., FTQ). By scanning this queue, prefetch requests can be issued to bring instructions that will be used soon in the instruction cache, improving performance. However, this lengthens the pipeline and increases the pipeline restart latency (e.g., after a branch misprediction), as any fetch address must go through the address generation (a.k.a. Decoupled Fetcher, or DCF stages before going through the Instruction Unit (IU) stages). This restart latency may be a contributor to performance degradation.
The following summary is an overview provided solely to aid in the description of various aspects of the disclosure and is provided solely for illustration of the aspects and not limitation thereof.
In accordance with aspects of the disclosure, a method is provided. The method may comprise detecting, in a processor, a re-fetch event, wherein the processor includes an instruction unit (IU) configured to fetch instructions from a decoupled fetcher (DCF) and simultaneously flushing the IU and the DCF in response to detecting of the re-fetch event.
In accordance with other aspects of the disclosure, an apparatus is provided. The apparatus may comprise a processor that includes an instruction unit (IU) configured to fetch instructions from a decoupled fetcher (DCF). The processor may be configured to detect a re-fetch event and simultaneously flush the DCF and the IU in response to detecting of the re-fetch event.
In accordance with yet other aspects of the disclosure, another apparatus is provided. The apparatus may comprise means for detecting a re-fetch event and means for simultaneously flushing an instruction unit (IU) and a decoupled fetcher (DCF) in response to detecting of the re-fetch event.
In accordance with yet other aspects of the disclosure, a non-transitory computer-readable medium is provided. The non-transitory computer-readable medium may comprise at least one instruction for causing a processor to perform operations, comprising code for detecting a re-fetch event and code for simultaneously flushing an instruction unit (IU) and a decoupled fetcher (DCF) in response to detecting of the re-fetch event.
The accompanying drawings are presented to aid in the description of aspects of the invention and are provided solely for illustration of the aspects and not limitation thereof.
Aspects of the invention are disclosed in the following description and related drawings directed to specific aspects of the invention. Alternate aspects may be devised without departing from the scope of the invention. Additionally, well-known elements of the invention will not be described in detail or will be omitted so as not to obscure the relevant details of the invention.
The word “exemplary” is used herein to mean “serving as an example, instance, or illustration.” Any aspect described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects. Likewise, the term “aspects of the invention” does not require that all aspects of the invention include the discussed feature, advantage or mode of operation.
The terminology used herein is for the purpose of describing particular aspects only and is not intended to be limiting of aspects of the invention. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises”, “comprising”, “includes” and/or “including”, when used herein, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
Further, many aspects are described in terms of sequences of actions to be performed by, for example, elements of a computing device. It will be recognized that various actions described herein can be performed by specific circuits (e.g., application specific integrated circuits (ASICs)), by program instructions being executed by one or more processors, or by a combination of both. Additionally, these sequence of actions described herein can be considered to be embodied entirely within any form of computer-readable storage medium having stored therein a corresponding set of computer instructions that upon execution would cause an associated processor to perform the functionality described herein. Thus, the various aspects of the invention may be embodied in a number of different forms, all of which have been contemplated to be within the scope of the claimed subject matter. In addition, for each of the aspects described herein, the corresponding form of any such aspects may be described herein as, for example, “logic configured to” perform the described action.
Exemplary aspects are directed to speeding up the restart of an instruction pipeline which follows the detection of a re-fetch event, such as a branch misprediction or a branch target buffer (BTB) miss. For example, in some aspects, to limit the impact of the increased redirection penalty, an instruction set architecture (ISA) may be implemented using a micro-architecture in which both the instruction unit (IU) and the decoupled fetcher (DCF) restart concurrently after a redirect signal (e.g., branch misprediction or miss in the DCF structures). That is, rather than have the IU wait for the DCF to generate fetch addresses, as it would in a regular pipelined architecture, both the IU and DCF may be restarted simultaneously. In one aspect, an advantage of such a micro-architecture may be that the additional pipeline restart latency may be hidden, improving performance.
The inclusion of the DCF 102 separate from the IU 104 means that the operations of instruction fetching are decoupled from fetch address generation (including branch prediction). This allows fetch address generation (i.e., by DCF 102) to run-ahead and enqueue many future fetch addresses in the FAQ 106.
However, as mentioned above, the lengthening of the pipeline due to the inclusion of DCF 102 increases the pipeline restart latency (e.g., after a branch misprediction) and also introduces a new potential flush source such as a BTB 108 miss. For example, in some cases, missing the BTB 108 may cause a flush if the missed block contains a branch instruction, whereas it may not trigger a flush if the missed block did not in fact have a branch. The increase in latency may be due to the fact that any fetch address must go through the address generation (e.g., DCF 102) stages before going through the IU 104 stages. This restart latency may be a contributor to performance degradation.
For example,
Accordingly,
In one aspect, the IU 304 has no branch prediction capability. Therefore, after a fast restart, the IU 304 is configured to wait for the DCF 302 as soon as it encounters a branch that must be predicted. That is, IU 304 may be configured to stop fetching at the first conditional or indirect branch, as IU 304 does not have prediction for what is the next PC. This may be sufficient to cover the additional restart latency due to decoupling instructions fetching from branch prediction.
In the illustrated example of
The MUX 322 gates where the IU 304 gets the address (PC) to drive the I-Cache 306. The MUX 322 inputs are information from the DCF 302 (IU 304 fed by DCF 302 in DCF mode), or the sequential PC of the last PC used to access the I-Cache 306 (IU 304 self-feed mode). The MUX 322 is driven by a latch unit 310 that is set when there is a flush request and reset when a branch goes out of the decoder 308. In some aspects, at that time, the IU 304 may have to scrub some of its state because of pipelining effects which amounts to killing inflight I-cache accesses (in case I-cache 306 is pipelined).
In the micro-architecture 400, the IU 404 may include a simple branch prediction infrastructure (i.e., branch predictor 432), allowing IU 404 to continue fetching without waiting for predictions provided by the DCF 402. Thus, IU 404 may generate its own fetch PCs until the DCF 402 catches up after a restart. However, this means that the IU 404 and the DCF 402 can disagree, leading to divergence. Divergence is detected by comparing two bitvectors 436 (one for the DCF 402 by way of DCF tracking register 426 and one for the IU by way of IU tracking register 428) tracking taken branches. Any mismatch triggers a partial flush to remove instructions fetched by the IU 404 that are not on the path suggested by the DCF 402. Restarting from this flush is fast as the DCF 402 already has correct fetch addresses enqueued in the decoupling queue (i.e., FAQ 414). To resynchronize, the two bitvectors 436 are maintained by way of tracking registers 426 and 428, whereas backpressure stalls the fetching by IU 404, the bitvectors 436 will line up and decoupled fetching may then be resumed.
The operation of micro-architecture 400 is similar to micro-architecture 300 described above, in that IU 404 may operate in either a self-feed mode or in a DCF mode. However, in the self-feed mode of the IU 404, the next PC 424 is generated using the branch predictor 432 and branch targets coming out of the decoder 408.
In response to detecting a re-fetch event, a trigger is generated to simultaneously flush both the DCF 402 and the IU 404. At flush time, the two head pointers 438 (head pointer_DCF and head pointer_FETCH) get reset to point to the top of the respective tracking vectors (not shown in
In some aspects, micro-architecture 400 may include two additional queues (not shown in
Returning now to
In a process block 902, both the DCF 402 and the IU 404 are simultaneously flushed. In a process block 906, the tracking vectors are initialized. In one aspect, initializing the tracking vectors may include resetting the bits in DCF tracking register 426 and in IU tracking register 428. Next, in a process block 908, the IU 404 enters the self-feed mode where IU 304 fetches instructions, including branch prediction, from I-Cache 406. In one aspect, the IU 404 continues fetching instructions in the self-feed mode until a tracking vector mismatch is detected (e.g., see mismatch illustrated in
In a particular aspect, input device 1030 and power supply 1044 are coupled to the SoC device 1022. Moreover, in a particular aspect, as illustrated in
In some aspects, the SoC device 1022 is a wireless communications device. However, in other aspects, processor 1002 and memory 1010 may also be integrated into a set-top box, a music player, a video player, an entertainment unit, a navigation device, a personal digital assistant (PDA), a fixed location data unit, a computer, a laptop, a tablet, a mobile phone, or other similar devices.
Those of skill in the art will appreciate that information and signals may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof.
Further, those of skill in the art will appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the aspects disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
Thus, a computing device may include several components that may be employed according to various aspects of the present disclosure. Such a computing device may include modules that incorporate any of the aforementioned instruction set architectures, such as micro-architecture 300 of
For example, a first module may be for detecting a re-fetch event in a processor such as processor 1002 of
The methods, sequences and/or algorithms described in connection with the aspects disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor.
Accordingly, an aspect of the invention can include a computer-readable media embodying a method for restart of an instruction pipeline. Accordingly, the invention is not limited to illustrated examples and any means for performing the functionality described herein are included in aspects of the invention.
While the foregoing disclosure shows illustrative aspects of the invention, it should be noted that various changes and modifications could be made herein without departing from the scope of the invention as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the aspects of the invention described herein need not be performed in any particular order. Furthermore, although elements of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.
The present Application for Patent claims the benefit of U.S. Provisional Patent Application No. 62/588,283, entitled “FAST PIPELINE RESTART IN PROCESSOR WITH DECOUPLED FETCHER,” filed Nov. 17, 2017, pending, assigned to the assignee hereof, and hereby expressly incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
62588283 | Nov 2017 | US |