Processors are implemented in a wide variety of computing devices, ranging from high end server computers to low end portable devices such as smartphones, netbook computers and so forth. In general, the processors all operate to execute instructions of a code stream to perform desired operations.
To effect operations on data, typically data is stored in general-purpose registers of the processor, which are storage locations within a core of the processor that can be identified as source or destination locations within the instructions. In general, there are a limited number of registers available in a processor. Oftentimes, a computer program can be optimized for a particular platform on which it executes. This optimization can take many forms and can include programmer or compiler-driven optimizations. One manner of optimization is to execute an instruction using hint information that can be provided with the instruction. However, the availability of hint sources for providing this hint information is relatively limited, which thus diminishes optimizations available via hint information.
In various embodiments, hint information for use in connection with various instructions to be executed within a processor can be provided more efficiently using an independent set of registers that can store the hint information. This independent register file is referred to generically herein as a hint register file. Although the scope of the present invention is not limited this regard, embodiments of such hint registers described herein are with regard to so-called data access instructions and accordingly, the hint registers to be described herein are also referred to as data access hint registers (DAHRs). However, the scope of the present invention is not limited in this regard, and hint registers can be provided for storing hint information used for purposes other than data access instructions such as instruction fetch behaviors, branch prediction behaviors, instruction dispersal behaviors, replay behaviors, etc. In fact, embodiments can apply to many scenarios in which there is more than one way to do something and, depending on the scenario, sometimes one way performs better and sometimes another way performs better.
By way of an independent register file for storing hint information, indexing information can be encoded into at least certain instructions to enable access to the hint information during instruction execution. Such hint information obtained from the hint registers can be used by various logic within the processor to optimize execution using the hint information.
In addition to providing a hint register file, a backup storage such as a stack can be provided to store multiple sets of hint values such that these values for different sections of code can be maintained efficiently within the processor in a stack associated with the DAHRs. For purposes of discussion, this stack can be referred to as a hint or DAHR stack (also referred to as a DAHS) and may be independent of other stacks within a processor.
Embodiments also provide for correct operation for legacy code written for processors that do not support hint registers. That is, embodiments can provide mechanisms to enable limited hint information associated with legacy code to obtain appropriate hint values using the data stored in the hint registers. In addition, because it is recognized that the hint information stored in these registers and used during execution does not affect correctness of operation, but instead aids in efficiency or optimization of the code, embodiments need not maintain absolute correctness of the hint information.
In various embodiments software can refine precisely how the processor should respond to locality hints specified by various data access instructions such as load, store, semaphore and explicit prefetch (lfetch) instructions, via the DAHRs. In various embodiments, a locality hint specified in the instruction selects one of the DAHRs, which then provides the hint information for use in the memory access. In one embodiment there are eight DAHRs usable by load, store and lfetch instructions (DAHR[0-7]); while semaphore instructions and load and store instructions with address post increment can use only the first four of these (DAHR[0-3]).
Note that each register of the hint register file can include a plurality of fields, each of which is to store hint information of a given type. In many embodiments, each register of the hint register file can have the same fields, where each register stores potentially different hint values in the different fields as programmed during operation.
Thus each DAHR contains fields which provide the processor with various types of data access hints. When a DAHR has not been explicitly programmed by software, these data hint fields can be automatically set to default values that best implement the generic locality hints as shown in Table 1, further details of which are below.
In some embodiments, DAHRs are not saved and restored as part of process context via an operating system, but are ephemeral state. When DAHR state is lost due to a context switch, the DAHRs revert to the default values. DAHRs may also revert to default values upon execution of a branch call instruction.
Embodiments may also optionally automatically save and restore the DAHRs on branch calls and returns in the hint stack within the processor. In one embodiment each stack level can include eight elements corresponding to the eight DAHRs. The number of stack levels may be implementation-dependent. On a branch call (and, in some embodiments, on certain interrupts), the elements in the stack are pushed down one level (the elements in the bottom stack level are lost), the values in the DAHRs are copied into the elements in the top stack level, and then the DAHRs revert to default values. On a branch return (and on return from the interrupt), the elements in the top stack level are copied into the DAHRs, and the elements in the stack are popped up one level, with the elements in the bottom stack level reverting to default values. In one embodiment, on an update to a backing store pointer for a register stack engine (RSE) (mov-to-BSPSTORE) instruction (used for a context switch, but rarely otherwise), which indicates to a general register hardware stack where in memory to spill registers when a hardware stack (that is separate from the hint stack) overflows, all DAHRs and all elements at all levels of the DAHS revert to default values.
Referring now to
Still referring to
In one embodiment, a representative move-to-hint register instruction may take the following form: mov dahr3=imm16. Responsive to this instruction, the source operand is copied to the destination register. More specifically, the value in imm16 is placed in the DAHR specified by the dahr3 instruction field.
Note that method 10 is used to write hint values into a given register of the hint register file according to code (e.g., user level or system level). Understand that upon system reset, default values can be loaded into all of the registers of the hint register file. Furthermore, although only a single register write instruction is shown in
When programming of the hint registers is completed, which may include programming of all the registers, a single register or some number in between, these registers can be accessed during execution of code to optimize some aspect of execution via this hint information stored in the hint registers. Also understand that a software function can program multiple DAHRs at different times. For example, the function can program and access a first of these programmed DAHRs (e.g., with a load instruction), and at a later point in the code program others of the DAHRs.
Referring now to
In various embodiments, rather than encoding hint information into this immediate value of the data access instruction, instead the immediate value can be used to convey an index into the hint register file. Thus the immediate value can be used as an index value to access a particular register of the hint register file, as seen at block 70 of
Referring now to
Still referring to
On a function return, control passes to block 260 where the hint values can be returned from the top of the hint stack to the registers of the hint register file. Accordingly, the previously stored values from the calling location can be returned such that the hint values usable by this portion of the code are present in the hint register file. As further seen in
Thus on a branch call such as to a function, the values in the DAHRs (if implemented) are pushed onto the hint stack, and the DAHRs revert to default values. Similarly, on a return, the values in the DAHRs are copied from the top level of the hint stack, the stack is popped, and the bottom level of the hint stack reverts to default values.
For a graphical illustration of the mechanisms for pushing hint values onto the hint stack and popping values from the hint stack into the hint registers, reference can be made to
Referring now to
Various specific data access hints can be implemented within DAHRs. In one embodiment, the data access hint register format is as shown in
The semantics of the hints for these hint fields in accordance with an embodiment of the present invention are described in the following Tables 3-9.
Table 3 above sets forth field values for a first-level (L1) cache field in accordance with one embodiment of the present invention. Specifically, the hints specified by fld_loc field 301 allow software to specify the locality, or likelihood of data reuse, with regard to the first-level (L1) cache. For example, the fld_nru hint can be used to indicate that the data has some non-temporal (spatial) locality (meaning that adjacent memory objects are likely to be referenced as well) but poor temporal locality (meaning that the referenced data is unlikely to be re-accessed soon). A processor may use this hint by placing the data in a separate non-temporal structure at the first level, if implemented, or by encaching the data in the level 1 cache, but marking the line as eligible for replacement. The fld_no_allocate hint is stronger, indicating that the data is unlikely to have any kind of locality (or likelihood of data reuse), with regard to the level 1 cache. A processor may use this hint by not allocating space at all for the data at level 1. Of course other uses for these and the other hint fields are possible in different embodiments.
Table 4 above sets forth field values for a mid-level (L2) cache field in accordance with one embodiment of the present invention. Specifically, the hints specified by mld_loc field 302 allow software to specify the locality, or likelihood of data reuse, with regard to the mid-level (L2) cache, similarly to the level 1 cache hints.
Table 5 above sets forth field values for a last-level (LLC) cache field in accordance with one embodiment of the present invention. Specifically, the hints specified by llc_loc field 303 allow software to specify the locality, or likelihood of data reuse, with regard to the last-level cache (LLC), similarly to the level 1 and 2 cache hints, except that there is not a no-allocate hint.
Table 6 above sets forth field values for a prefetch field in accordance with one embodiment of the present invention. The hints specified by pf field 304 allow software to control any data prefetching that may be initiated by the processor based on this reference. Such automatic data prefetching can be disabled at the first-level cache (pf_no_fld), the mid-level cache (pf_no_mld), or at all cache levels (pf_none).
Table 7 above sets forth field values for another prefetch field in accordance with an embodiment of the present invention. The hints specified by pf_drop field 305 allow software further control over any software-initiated data prefetching due to this instruction (for the lfetch instruction) or any data prefetching that may be initiated by the processor based on this reference. Rather than disabling prefetching into various levels of cache, as provided by hints in the pf field, hints specified by this field allow software to specify that prefetching should be done, unless the processor determines that such prefetching would require additional execution resources. For example, prefetches may be dropped if it is determined that the virtual address translation needed is not already in a data translation lookaside buffer (TLB) (pfd_tlb); if it is determined that either the translation is not present or the data is not already at least at the mid-level cache level (pfd_tlb_mld); or if these or any other additional execution resources are needed in order to perform the prefetch (pfd_any).
Table 8 above sets forth example values for further prefetch hint values in accordance with an embodiment of the present invention. The hints specified by pipe field 306 allow software to specify how likely or soon it is to need the data specified by an lfetch instruction or a speculative load instruction. The pipe_defer hint indicates that the data should be prefetched as soon as possible (lfetch instruction) or copied into the target general register (speculative load instruction) if it would not be very disruptive to the execution pipeline to do so. If this data movement might delay the pipeline execution of subsequent instructions (for example, due to TLB or mid-level cache misses), the instruction is instead executed in the background, allowing the pipeline to continue executing subsequent instructions. For speculative load instructions, if this background execution would take significantly extra time, the processor may spontaneously defer the speculative load, as allowed by a given recovery model.
The pipe_block hint indicates that the data should be prefetched as soon as possible (lfetch instruction) or copied into the target general register (speculative load instruction) independent of whether this might delay the pipeline execution of subsequent instructions. For speculative load instructions, no spontaneous deferral is done.
Table 9 above sets forth hint values for a cache coherency hint field in accordance with one embodiment of the present invention. The hints specified by bias field 307 allow software to optimize cache coherence activities. For load instructions and lfetch instructions, if the referenced line is not already present in the processor's cache, and if the processor can encache the data in either the shared or the modified status of a modified exclusive shared invalid (MESI) protocol, the bias_excl hint indicates that the processor should encache the data in the exclusive state, while the bias_shared hint indicates that the processor should encache the data in the shared state.
Embodiments may be implemented in instructions for execution by a processor, including instructions of a given ISA. These instructions can include both specific instructions such as the instructions described above to store values in to hint registers, as well as instructions that index into a given hint register of the hint register file to obtain hint information for use in connection with instruction execution.
As an example, processor logic can receive a first instruction such as a given register write instruction that includes an identifier of a first hint register of the hint register file and further includes a first value to be stored into the register (which can be provided as an immediate data of the instruction). Responsive to this instruction, the logic can store the first value in the first hint register. This first value may include individual values each corresponding to a hint field of the first hint register.
After this programming of the hint register, the logic can receive a second instruction to perform an operation according to an opcode of the instruction. Note that this instruction may have a data portion (such as an immediate data field) to index the first hint register of the hint register file. Then the operation can be performed according to at least one of the individual values stored in the first hint register. In this way, optimization of the operation can occur using this hint information.
Embodiments can be implemented in many different processor types. For example, embodiments can be realized in a processor such as a single core or multicore processor. Referring now to
As shown in
Coupled between front end unit 510 and execution units 520 is an out-of-order (OOO) engine 515 that may be used to receive the micro-instructions and prepare them for execution. More specifically OOO engine 515 may include various buffers to re-order micro-instruction flow and allocate various resources needed for execution, as well as to provide renaming of logical registers onto storage locations within various register files such as register file 530 and extended register file 535. Register file 530 may include separate register files for integer and floating point operations. Extended register file 535 may provide storage for vector-sized units, e.g., 256 or 512 bits per register. As further seen, a hint register file 538 may be present that includes a plurality of registers, e.g., having the field structure shown in
Various resources may be present in execution units 520, including, for example, various integer, floating point, and single instruction multiple data (SIMD) logic units, among other specialized hardware. For example, such execution units may include one or more arithmetic logic units (ALUs) 522.
When operations are performed on data within the execution unit, results may be provided to retirement logic, namely a reorder buffer (ROB) 540. More specifically, ROB 540 may include various arrays and logic to receive information associated with instructions that are executed. This information is then examined by ROB 540 to determine whether the instructions can be validly retired and result data committed to the architectural state of the processor, or whether one or more exceptions occurred that prevent a proper retirement of the instructions. Of course, ROB 540 may handle other operations associated with retirement.
As shown in
Note that while the implementation of the processor of
Embodiments may be implemented in many different system types. Referring now to
Still referring to
Furthermore, chipset 690 includes an interface 692 to couple chipset 690 with a high performance graphics engine 638, by a P-P interconnect 639. In turn, chipset 690 may be coupled to a first bus 616 via an interface 696. As shown in
Embodiments may be implemented in code and may be stored on a non-transitory storage medium having stored thereon instructions which can be used to program a system to perform the instructions. The storage medium may include, but is not limited to, any type of disk including floppy disks, optical disks, solid state drives (SSDs), compact disk read-only memories (CD-ROMs), compact disk rewritables (CD-RWs), and magneto-optical disks, semiconductor devices such as read-only memories (ROMs), random access memories (RAMs) such as dynamic random access memories (DRAMs), static random access memories (SRAMs), erasable programmable read-only memories (EPROMs), flash memories, electrically erasable programmable read-only memories (EEPROMs), magnetic or optical cards, or any other type of media suitable for storing electronic instructions.
While the present invention has been described with respect to a limited number of embodiments, those skilled in the art will appreciate numerous modifications and variations therefrom. It is intended that the appended claims cover all such modifications and variations as fall within the true spirit and scope of this present invention.