In modern processors, so-called user-level instructions, namely instructions of an instruction set architecture (ISA), are in the form of macro-instructions. These instructions as implemented in software are not directly executed by processor hardware due to the complexity of the instruction set. Instead, each macro-instruction is typically translated into a series of one or more micro-operations (uops). It is these uops that are directly executed by the hardware. The one or more micro-operations corresponding to a macro-instruction is referred to as a microcode flow for that macro-instruction. The combined execution of all the flow's uops produces the overall results (e.g., as reflected in registers, memory, etc.) specified for that instruction architecturally. The translation of a macro-instruction into one or more uops is associated with the instruction fetch and decode portion of a processor's overall pipeline.
In modern out-of-order processors, the microcode that includes the uops of the microcode flows is stored in a read only memory (ROM) of the processor, referred to as a uROM. Reading of microcode out of uROM is tied to a microcode sequencer (MS) pipeline of the processor. While the location of this ROM within the processor provides for minimal latency in accessing uops therefrom, its read only nature prevents updates to the microcode and further places a practical limit on the size of the available microcode.
In various embodiments, microcode may be stored in architecturally addressable storage space (such as system memory, e.g., a dynamic random access memory (DRAM)), within and/or outside of a processor. The exact location of such storage can vary in different implementations, but in general may be anywhere within a memory system, e.g., from cache storage within the processor to mass memory of a system. A virtualized microcode sequencer (MS) mechanism may be used to fetch and sequence microcode stored in this architecturally addressable space.
In modern processors, generally speaking, the MS holds all complex microcode flows in microcode read only memory (uROM) and random access memory (uRAM) (which stores patch microcode) and is responsible for sequencing and participating in executing these microcode flows. By performing microcode sequencer virtualization in accordance with an embodiment of the present invention, “microcode” can be stored in any architecturally addressable memory space. The term “microcode” as used herein thus refers both to microcode flows conventionally stored in MS uROM and uRAM, traditionally referred as the microcode, and microcode flows in accordance with an embodiment of the present invention stored there or elsewhere and which can be generated in some implementations through other means, such as binary translation, static compilation, or manually written (e.g., to emulate or implement new instructions or capabilities on an existing implementation), etc. Such microcode flows may use the same microcode instruction set that is used to implement those stored in uROM. However, they may be stored in different places in the architecturally addressable memory hierarchy. A virtualized microcode sequencer enables the MS of a processor to fetch and sequence both existing uROM microcode as well as new microcode flows stored elsewhere. The virtualized microcode sequencer leverages an instruction fetch unit (IFU) of the processor to fetch “microcode” stored in the architecturally addressable space into the machine and to cache them into an instruction cache, and may have different designs in different implementations.
In some embodiments, a conventional uROM may be completely removed from the MS and the image stored in this uROM may instead be placed in an architecturally addressable memory space. The location of this space can vary, and may be present in another storage of a processor or outside of a processor, either in a system memory, mass storage, or other addressable storage device. In this way, a virtualized microcode sequencer mechanism can support a full micro-instruction set that is complete in functionality. In other implementations, it is possible to implement a hybrid arrangement such that additional microcode (apart from a uROM and potentially a uRAM) can be stored in another part of a memory hierarchy.
Referring now to
Referring now to
As seen in
An instruction queue 260 may store incoming instruction bytes prior to their being decoded in an instruction decoder 265. The instruction decoder is further coupled to a microcode sequencer 270. Note that in various embodiments, this microcode sequencer may not include any uROM, in contrast to a conventional microcode sequencer. However, in a hybrid implementation, at least some amount of microcode can be stored in a uROM of microcode sequencer 270, as will be discussed further below. The outputs of the microcode sequencer 270 and instruction decoder 265 and a macro alias register (MAR) 268 may be provided to a backend selector 280, which resolves aliases and provides the selected instructions to decoded instruction queue 285. In turn, instructions corresponding to decoded uops may be passed to further portions of a processor pipeline. For example, from this front end unit, decoded instructions may be provided to processing logic of processor. In yet other implementations, the decoded instructions may be provided to an out-of-order unit, to issue uops to execution units out of program order.
As further shown in
As seen in
Macro-instructions are read out of IQ 260 and then are decoded into uops by decoders in instruction decoder (ID) 265. ID 265 provides uops for simple macro-instructions whose microcode flows require no more than a predetermined minimal number of uops (e.g., 4). Instead, microcode sequencer (MS) 270 sequences uops for complex macro-instructions whose microcode flows require more than the predetermined minimal number of uops. Uops produced by both ID 265 and MS 270 may be written into instruction decode queue (IDQ) 285 where they are read out and sent to subsequent pipeline stages.
To enable accessing microcode from wherever it may be stored in a hierarchy of an addressable memory subsystem, the MS is invoked when a complex instruction is encountered. Given that in various embodiments, some or all of the uROM is removed from MS 270 and the uROM image resides in the memory space, a request engine of the MS may convert a uROM read into an IFU fetch request. That is, each complete transaction has a predefined entry point in the uROM. This is used when the decoder detects a CISC instruction. After that a MS jump execution unit (JEU) and UIF recycle logic provide the source of the next uop, as described further below. As seen in
Note that in various embodiments, the instruction cache lookup for a MS microcode fetch request is the same as that for a macro-instruction code fetch request. If the lookup hits instruction cache 220, the data read out of the cache is considered as valid and is forwarded to the subsequent pipeline stages. If instead the lookup misses the cache, the IFU may become responsible for acquiring the data from memory through streaming buffer (SB) 225, which interfaces with an IFU/memory execution unit (MEU) interface 230.
If the lookup is for a MS microcode fetch, the data read out of instruction cache 220 is steered to MS 270. The normal macro-instruction path, e.g., length decode and steering, does not see the data. If instead the lookup is for a macro-instruction code fetch, the data read out of the instruction cache is steered to the above path and MS 270 does not see the data.
In the case of a cache miss, the IFU stalls the pipeline and waits for the data. If the lookup was for a macro-instruction code fetch, the IFU detects the return of data from memory and re-issues the original code fetch into the pipeline. If the lookup was for a MS microcode fetch, the MS request engine is responsible for detecting the return of data from memory and re-injecting the MS microcode fetch request into the IFU pipeline. The MS request engine monitors: (1) cache hit/miss for each granted MS request through the IFU pipeline; and (2) an SB data read signal, the MS request engine holds the request that misses in the instruction cache and resends it to the IFU through the same path that it was sent last time such that IFU stalls.
Note that iTLB 210 may be bypassed for MS fetch requests due the need for fetching reset microcode and other initialization microcode. As a result, the MS request engine may generate the corresponding physical address. Note that bypass is associated only with MS fetch requests and thus is transient, enabling the IFU to service interleaved MS fetch requests and macro-instruction code fetch requests.
To support a virtualized microcode sequencer, the following capabilities may be present: decoupling uROM data delivery from the MS pipeline; multiplexing IFU pipeline in time to fetch both macro-instruction and MS microcode fetches; and sharing the instruction cache for both macro-instructions and microcode.
In contrast to conventional MS implementations that rely on fixed uROM read latency (as the uROM read operations are tightly designed into the MS pipeline), embodiments may have a variable and longer latency. That is, where there is not a local uROM, such as where microcode comes from RAM external to the core, the latency can be much higher. Embodiments may provide for logic within the MS pipeline to accommodate the situation where micro-ops are delivered to the pipeline with a variable and long latency.
This logic may implement various operations to enable the latency. First, a microcode instruction pointer (UIP) is recycled until microcode data is delivered to the MS pipeline. This uop recycling mechanism may handle IQ thread allocation and uop queue allocation algorithm and handle the interaction between MS uROM read redirection and micro-operation execution. To this end, a uop valid mechanism may control uops to appear invalid to subsequent pipeline stages until the actual uop data is delivered to the MS pipeline. Still further, the logic may initiate multiple IFU requests to assemble one uROM line to be delivered into the MS pipeline.
Sharing the IFU pipeline for both macro-instruction and microcode fetch can be realized in part by implementing the front end selector or multiplexer to select between the LIP (for macro-instructions) and the MS request's physical address for the next cache lookup. Then, after the cache lookup, steering logic 250 may, after cache lookup steer microcode data to MS 270. Macro-instruction code data is steered to the normal macro-instruction path, e.g., instruction length decoder and instruction steering vector generator. ITLB lookups for MS requests may be bypassed while keeping instruction cache tag lookup valid. A physical address can be used to access microcode RAM space. To further enable multiplexing of the two types of instruction information, the IFU pipeline may service MS requests while the IFU is either in a “sleep” mode or in a “stalled” mode. These modes are preserved while servicing MS requests so that when the MS request is completed, the remainder of the IFU would still think it is either in “sleep” mode (e.g., immediately after hardware reset) or in “stalled” mode. In addition, the IFU may provide for nested stage l stalls. In this way, nested macro-instruction iTLB miss and microcode instruction cache/streaming buffer (IC/SB) miss, nested macro-instruction iTLB fault and microcode IC/SB miss, and macro-instruction uncacheable (UC) memory type fetch and microcode IC/SB miss can occur to bring in service microcode flow to handle iTLB miss/fault/UC fetch. Here, the MS request engine determines such case and makes the SB available for fetching service microcode. Still further, the streaming buffer may be made available for fetching microcode from memory when macro-instruction code fetch is UC and splits across two cache lines, in which case both SB entries are occupied and are released by the microcode service flow. Embodiments may further re-order the arrival time of tag and data when the streaming buffer fills the instruction cache to ensure that concurrent cache lookup will get the correct data for a hit when victim cache (VC) 215 is disabled. Finally, the MS may issue an IFU request only after the machine becomes quiescent, e.g., when the pipeline is drained, the IFU stalled for macro-instruction fetch, and so forth.
To decouple uROM data delivery from the MS pipeline, a uROM image physical base register may specify the offset of the base of uROM image in physical address space. This register may be stored in a fuse block and downloaded to the core during reset. Note that this register may provide for an address space translation from 14-bit uop instruction pointer (UIP) space to 46-bit physical address space. In one embodiment, each uROM line may store 40-bytes worth of data. To ease the address space translation, each uROM line may occupy one 64-byte cache line. Therefore the address space translation enables making the 14-bit UIP as a 64-byte cache line pointer.
Various states may be incorporated in a request engine of a microcode sequencer to decouple uROM data delivery from the microcode sequencer. A first indicator, referred to as Waiting_uROM_Line, may be used to indicate if the MS pipeline is waiting for data, i.e., a uROM line. If a MS fetch request hits in the instruction cache, it takes a few cycles for the MS request engine to receive the data and then deliver the data to the pipeline. If the MS fetch request misses the instruction cache, it can take a variable number of cycles (e.g., much greater than 2) for the MS request engine to receive the data. This wait state can be denoted by this first indicator. This state may be cleared once the MS request engine delivers the received microcode line to the MS pipeline. In various embodiments, the MS request engine does not make another IFU fetch request until the microcode line for the previous request is received and delivered to the MS pipeline. This first indicator is thus used to indicate that there is an outstanding uROM fetch.
A second indicator, referred to as Wait Valid, may be used to indicate whether the first indicator is valid. That is, it may take many cycles for a uROM line to be delivered to the MS pipeline, as explained above. Due to the fact that uop execution and thread selection occur in parallel with uROM line fetching, it is possible that the uROM line being fetched becomes invalid due to branches, events, and thread selection changes before the line is received. Accordingly, this second indicator may track whether any uROM line invalidation condition occurs while the uROM line is being fetched. If this second indicator indicates the uROM line is invalid when it is received, the corresponding line may be dropped or marked as invalid as it is delivered to the MS pipeline.
In one embodiment, a MS request engine may conceptually be considered to contain two finite state machines (FSM's), one interfacing to the MS pipeline itself (and its state machine) and the other of which interfaces to the IFU pipeline (and its state machine).
Embodiments can be implemented in many different systems. For example, embodiments can be realized in a processor such as a multicore processor. Referring now to
As shown in
Coupled between front end units 310 and execution units 320 is an out-of-order (OOO) engine 315 that may be used to receive the micro-instructions and prepare them for execution. More specifically OOO engine 315 may include various buffers to re-order micro-instruction flow and allocate various resources needed for execution, as well as to provide renaming of logical registers onto storage locations within various register files such as register file 330 and extended register file 335. Register file 330 may include separate register files for integer and floating point operations. Extended register file 335 may provide storage for vector-sized units, e.g., 256 or 512 bits per register.
Various resources may be present in execution units 320, including, for example, various integer, floating point, and single instruction multiple data (SIMD) logic units, among other specialized hardware. For example, such execution units may include one or more arithmetic logic units (ALUs) 322. Results may be provided to retirement logic, namely a reorder buffer (ROB) 340. More specifically, ROB 340 may include various arrays and logic to receive information associated with instructions that are executed. This information is then examined by ROB 340 to determine whether the instructions can be validly retired and result data committed to the architectural state of the processor, or whether one or more exceptions occurred that prevent a proper retirement of the instructions. Of course, ROB 340 may handle other operations associated with retirement.
As shown in
Note that if branches or events are detected or thread selection flips, the first FSM will leave Fetching state and enter the Invalid Line state (block 430). If the second FSM is in the Wait state, e.g., waiting for the pipeline to drain when this transition happens, this FSM returns to the idle state (block 460) and no IFU request is made. However, if the second FSM has already transitioned out of the wait state (e.g., waiting for pipeline drain state) when this transition happens, it will complete its IFU requests (block 450).
As shown in
With regard to state machine 450, which is a state machine that interfaces the MS request engine and the IFU pipeline, an idle state 460 is exited when a fetching state occurs, which causes a state 470 to be entered to cause the IFU to become quiescent and an execution pipeline to be drained. If the streaming buffer is ready, control passes to an IFU request state 480, where the request can be made of the IFU. If the line is present in the cache, it is returned to the MS request engine and control passes back to the idle state. Otherwise on a cache miss, a wait memory state 490 occurs. Note that if the streaming buffer is full and uncacheable lines are present, a purge streaming buffer state 485 is entered from state 470.
Once the full uROM line is received in the MS, the second FSM delivers the line to the first FSM, which will then return to the idle state (block 410). Note that this line will be marked as invalid as it is delivered to the MS pipeline (and thus will not be executed) if the first FSM is in the Invalid Line state when it receives the line.
Some processor implementations may include an IFU and front end (FE) pipeline designed to fetch macro-instruction bytes only. To enable embodiments to be adapted to such a design (and allow storage of received microcode in the instruction cache), the IFU may be multiplexed in time, as discussed above. In this way, the IFU can service fetch requests from both the normal macro-instruction state machines and the microcode sequencer. As such, microcode data may be present within RAM, cache hierarchy, IFU, and MS. To this end, the logic of
The IFU may handle various stall conditions when fetching macro-instruction instructions. Even though these conditions may not result in actual data-path pipeline stall due to simultaneous multithreading (SMT), a thread-specific fetching engine may be implemented to handle these conditions. In one embodiment, these stall conditions are partitioned into two groups (GRP1 and GRP2) as follows.
GRP1 stall conditions are conditions that need microcode assists, and may include ITLB fault, ITLB miss, and UC fetch. GRP2 stall conditions may be conditions that can be resolved purely by hardware such as a miss in the instruction cache, instruction victim cache and instruction streaming buffer, a hit in the streaming buffer with data not ready (SBstall), a miss to a locked instruction cache set, or where the stream buffer not ready for a fetch request.
In one embodiment, the “IFU quiescent” portion of “drain exe pipeline & IFU quiescent” (state 420) of the second FSM of
Macro-instruction code fetch cache misses are sent to a stall address, stall handler, and SB. The stall address and stall handler are designed to control the pipeline, while the SB is designed to fetch macro-instruction code from memory on a cache miss. For microcode fetch, cache misses are gated from going to the stall address and stall handler but continue to go to the SB. Therefore the SB is used to fetch code from memory for both macro-instruction code and microcode. In addition, microcode fetch cache misses may also be sent to the microcode requester and the SB will also inform the microcode requester when data from memory is ready. Accordingly, the microcode requester may re-issue the request through a front end multiplexer of the MS (e.g., multiplexer 205 of
A UC code fetch that splits across cache lines gives rise to a boundary condition in which the SB provides two entries that can hold two cache lines. To obtain all bytes of an instruction in this situation, two cache lines are to be read and thus all SB resources are consumed. On the other hand, before the entire UC fetch sequence can be completed so that SB entries that hold UC code cache lines can be released, a microcode fetch is performed. Therefore, the “SB full & UC line” state of the second FSM may be implemented to release SB resources for microcode fetch (block 485) after UC code fetch is done and consumed but before the normal UC fetch sequence can be completed to release such resources.
Microcode sequencer virtualization in accordance with an embodiment of the present invention thus allows “microcode” to be generated (even at run time) and stored in the architecturally addressable RAM space other than uROM. The virtualized microcode sequencer enables a processor to fetch and sequence both existing uROM microcode as well as microcode flows generated at run time. In certain embodiments, an IFU can be used to fetch “microcode” stored in the architecturally addressable space into the machine and to cache them into the instruction cache. Using an embodiment of the present invention, microcode can be generated post-silicon, which provides flexibilities to extend a silicon feature set post-silicon. Further by enabling more realizable microcode update, a new revenue source for a processor manufacturer can be realized. Furthermore, flexible microcode strategies enable performance/cost/power/complexity trade-off, and can further provide new ways to work around silicon issues. In one embodiment, thread selection may be based on IQ (instruction queue) emptiness, IDQ (uop queue/buffer) fullness (pre-allocation), and MS state (idle/stall).
Referring now to
If instead it is determined that the macro-instruction is complex, control instead passes from diamond 510 to block 520. There a microcode fetch may be triggered in a microcode sequencer. That is, if the determination is made that the macro-instruction is complex, the instruction decoder may send a signal and the corresponding macro-instruction to the microcode sequencer for implementing fetch and sequencing operations. Microcode fetch may be triggered by issuing an instruction fetch request for the microcode (block 530). This request may be sent from the microcode sequencer in the form of a next uop instruction pointer, which after being translated into a physical address is sent to a front end of the instruction fetch unit.
As discussed above, time multiplexing may occur between this instruction request and requests coming from other paths to the IFU such as branch predictors or so forth. When the multiplexer or other selector of the instruction fetch unit provides the uop instruction pointer in the form of physical address to storage structures of the IFU including an instruction cache and a streaming buffer, it may be determined whether a hit occurs (diamond 535). If not, the IFU issues a read request to the memory hierarchy to obtain the requested microcode. That is, because the micro sequencer does not include an on-board uROM, a read request is issued to the addressable memory space (block 540). At various intervals, the microcode sequencer may detect the return of the requested instruction (diamond 550). This detection may be implemented using various mechanisms of the microcode sequencer. For example, the IFU may allow one outstanding instruction cache miss at a time. The IFU stalls when an instruction cache miss occurs and waits for data from the SB. The SB informs a stall FSM to reissue the request in a normal case. When a virtualized microcode sequencer in accordance with an embodiment of the present invention is actuated, the IFU stall FSM will not change state. So the MS requester hijacks the SB signal on data ready, when the return is detected, and control passes back to block 530, discussed above. This time, a hit will occur in at least the streaming buffer. Accordingly, control passes to block 560, where the desired microcode is received in the microcode sequencer. Accordingly, the microcode sequencer may generate and sequence from the received microcode a set of uops that correspond to the macro-instruction (block 570). Control then again passes to block 580, for storage of the uops in the decoded queue, where they can be provided to the pipeline. While shown with this particular implementation in the embodiment of
Embodiments may be implemented in many different system types. Referring now to
Still referring to
Furthermore, chipset 690 includes an interface 692 to couple chipset 690 with a high performance graphics engine 638, by a P-P interconnect 639. In turn, chipset 690 may be coupled to a first bus 616 via an interface 696. As shown in
Referring now to
When decoder 265 detects a CISC instruction, it provides the first number of uops (e.g., 4) to decoded queue 285, and the remainder will be delivered by the MS. A first vector 288 is the MS entry point UIP generated for the MS to read the uops immediately following the first 4 uops delivered by instruction decoder 265. Subsequently, NUIP logic 255 selects one UIP from UIPs generated by the JEU, MS branch execution, or recycle logic 258. Note that recycle logic 258 may tolerate cache miss and stalls in the IFU when uops are fetched from memory.
Mechanisms may enable a front end restart to work with or without a victim cache (VC), which enables maintaining an inclusive property. Each instruction cache line may include a tag with a bit to differentiate a macro-instruction line and a microcode line. In general, the logic may operate to detect a CISC instruction X, and determine the UIP. Then the cache lines that contain X can be identified, which may be 1 or 2 cache lines that could be in the instruction cache or VC. The IFU is caused to be quiescent and the pipeline is drained. If both lines containing X are in VC, the MS may resume, described further below. If instead, one line is in the VC and one line is in the instruction cache, the line in the VC is read out and then the VC is flushed. Next lines in the instruction cache containing X are evicted into the VC and the line from the instruction cache is read out, if it exists, is moved into the VC. Thereafter, the MS resumes, and the IFU fetches both macro-instructions and microcode. When a new line is to be placed into the instruction cache, if it is a macro-instruction line, the replaced macro-instruction line is evicted into the VC, whereas a replaced microcode line is simply dropped. Note that without a VC, the MS request engine resets the front end restart FSM to start over if the front end restart misses in the cache.
Embodiments may be implemented in code and may be stored on a storage medium having stored thereon instructions which can be used to program a system to perform the instructions. The storage medium may include, but is not limited to, any type of disk including floppy disks, optical disks, optical disks, solid state drives (SSDs), compact disk read-only memories (CD-ROMs), compact disk rewritables (CD-RWs), and magneto-optical disks, semiconductor devices such as read-only memories (ROMs), random access memories (RAMs) such as dynamic random access memories (DRAMs), static random access memories (SRAMs), erasable programmable read-only memories (EPROMs), flash memories, electrically erasable programmable read-only memories (EEPROMs), magnetic or optical cards, or any other type of media suitable for storing electronic instructions.
While the present invention has been described with respect to a limited number of embodiments, those skilled in the art will appreciate numerous modifications and variations therefrom. It is intended that the appended claims cover all such modifications and variations as fall within the true spirit and scope of this present invention.
This application claims priority to U.S. Provisional Patent Application No. 61/349,629 filed on May 28, 2010, entitled METHOD AND APPARATUS FOR VIRTUALIZED MICROCODE SEQUENCING.
Number | Date | Country | |
---|---|---|---|
61349629 | May 2010 | US |