Embodiments relate to a processor and more particularly to a processor having profiling capabilities.
During a design process of a processor, dynamic profiling of instructions is traditionally used prior to a hardware design freeze for improving instruction set architecture (ISA) performance and/or improving software performance on a fixed ISA prior to a software design freeze. However, this approach suffers in that the optimal ISA performance is based on simulations that assume certain system behavior (memory accesses for instance) that could be different in reality. As such, optimal ISA performance is based on simulations that may not cover all possibilities that could occur in real life applications post hardware design freeze.
In various embodiments, techniques are provided for performing non-invasive dynamic profiling to enable a mechanism to improve ISA performance post hardware design freeze. The basic principle involves in situ profiling of instructions executed on a processor, in an intelligent manner. To this end, embodiments may track and keep count of the most used set of select instructions. Dynamic profiling of all instructions would be expensive in terms of area. Instead, embodiments may identify a subset of instructions based at least in part on static analysis of code during compile time, to identify potential candidate instructions suitable for dynamic profiling.
In turn, these potential candidate instructions may be profiled dynamically during runtime to identify a subset of these instructions that are most active. Hint information regarding these most active instructions of the potential candidate instructions may be provided to various resources of a processor to optimize performance. In a particular example, this hint information can be provided to an instruction caching structure to optimize storage and maintenance of these most used instructions within the caching structure. In this way, the performance penalty of cache miss for most active instructions can be reduced or avoided.
In the following description, for the purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the invention described below. It will be apparent, however, to one skilled in the art that the embodiments of the invention may be practiced without some of these specific details. In other instances, well-known structures and devices are shown in block diagram form to avoid obscuring the underlying principles of the embodiments of the invention.
In
The front end unit 130 includes a branch prediction unit 132 coupled to an instruction cache unit 134, which is coupled to an instruction translation lookaside buffer (TLB) 136, which is coupled to an instruction fetch unit 138, which is coupled to a decode unit 140. The decode unit 140 (or decoder) may decode instructions, and generate as an output one or more micro-operations, micro-code entry points, microinstructions, other instructions, or other control signals, which are decoded from, or which otherwise reflect, or are derived from, the original instructions. The decode unit 140 may be implemented using various different mechanisms. Examples of suitable mechanisms include, but are not limited to, look-up tables, hardware implementations, programmable logic arrays (PLAs), microcode read only memories (ROMs), etc. In one embodiment, the core 190 includes a microcode ROM or other medium that stores microcode for certain macroinstructions (e.g., in decode unit 140 or otherwise within the front end unit 130). The decode unit 140 is coupled to a rename/allocator unit 152 in the execution engine unit 150.
The execution engine unit 150 includes the rename/allocator unit 152 coupled to a retirement unit 154 and a set of one or more scheduler unit(s) 156. The scheduler unit(s) 156 represents any number of different schedulers, including reservations stations, central instruction window, etc. The scheduler unit(s) 156 is coupled to the physical register file(s) unit(s) 158. Each of the physical register file(s) unit(s) 158 represents one or more physical register files, different ones of which store one or more different data types, such as scalar integer, scalar floating point, packed integer, packed floating point, vector integer, vector floating point, status (e.g., an instruction pointer that is the address of the next instruction to be executed), etc. In one embodiment, the physical register file(s) unit 158 comprises a vector register unit, a write mask register unit, and a scalar register unit. These register units may provide architectural vector registers, vector mask registers, and general purpose registers. The physical register file(s) unit(s) 158 is overlapped by the retirement unit 154 to illustrate various ways in which register renaming and out-of-order execution may be implemented (e.g., using a reorder buffer(s) and a retirement register file(s); using a future file(s), a history buffer(s), and a retirement register file(s); using a register maps and a pool of registers; etc.). The retirement unit 154 and the physical register file unit(s) 158 are coupled to the execution cluster(s) 160. The execution cluster(s) 160 includes a set of one or more execution units 162 and a set of one or more memory access units 164. The execution units 162 may perform various operations (e.g., shifts, addition, subtraction, multiplication) and on various types of data (e.g., scalar floating point, packed integer, packed floating point, vector integer, vector floating point). While some embodiments may include a number of execution units dedicated to specific functions or sets of functions, other embodiments may include only one execution unit or multiple execution units that all perform all functions. The scheduler unit(s) 156, physical register file(s) unit(s) 158, and execution cluster(s) 160 are shown as being possibly plural because certain embodiments create separate pipelines for certain types of data/operations (e.g., a scalar integer pipeline, a scalar floating point/packed integer/packed floating point/vector integer/vector floating point pipeline, and/or a memory access pipeline that each have their own scheduler unit, physical register file(s) unit, and/or execution cluster—and in the case of a separate memory access pipeline, certain embodiments are implemented in which only the execution cluster of this pipeline has the memory access unit(s) 164). It should also be understood that where separate pipelines are used, one or more of these pipelines may be out-of-order issue/execution and the rest in-order.
The set of memory access units 164 is coupled to the memory unit 170, which includes a data TLB unit 172 coupled to a data cache unit 174 coupled to a level 2 (L2) cache unit 176. Instruction cache unit 134 and data cache unit 174 may together be considered to be a distributed L1 cache. In one exemplary embodiment, the memory access units 164 may include a load unit, a store address unit, and a store data unit, each of which is coupled to the data TLB unit 172 in the memory unit 170. The instruction cache unit 134 is further coupled to a level 2 (L2) cache unit 176 in the memory unit 170. The L2 cache unit 176 may be coupled to one or more other levels of cache and eventually to a main memory.
By way of example, the exemplary register renaming, out-of-order issue/execution core architecture may implement the pipeline 100 as follows: 1) the instruction fetch unit 138 performs the fetch and length decoding stages 102 and 104; 2) the decode unit 140 performs the decode stage 106; 3) the rename/allocator unit 152 performs the allocation stage 108 and renaming stage 110; 4) the scheduler unit(s) 156 performs the schedule stage 112; 5) the physical register file unit(s) 158 and the memory unit 170 perform the register read/memory read stage 114; the execution cluster 160 perform the execute stage 116; 6) the memory unit 170 and the physical register file(s) unit(s) 158 perform the write back/memory write stage 118; 7) various units may be involved in the exception handling stage 122; and 8) the retirement unit 154 and the physical register file(s) unit(s) 158 perform the commit stage 124.
The core 190 may support one or more instructions sets (e.g., the x86 instruction set (with some extensions that have been added with newer versions); the MIPS instruction set developed by MIPS Technologies of Sunnyvale, Calif.; the ARM instruction set (with optional additional extensions such as NEON) of ARM Holdings of Sunnyvale, Calif.), including the instruction(s) described herein. In one embodiment, the core 190 includes logic to support a packed data instruction set extension (e.g., AVX1, AVX2, and/or some form of the generic vector friendly instruction format (U=0 and/or U=1)), thereby allowing the operations used by many multimedia applications to be performed using packed data.
It should be understood that the core may support multithreading (executing two or more parallel sets of operations or threads), and may do so in a variety of ways including time sliced multithreading, simultaneous multithreading (where a single physical core provides a logical core for each of the threads that physical core is simultaneously multithreading), or a combination thereof (e.g., time sliced fetching and decoding and simultaneous multithreading thereafter such as in the Intel® Hyperthreading technology).
While register renaming is described in the context of out-of-order execution, it should be understood that register renaming may be used in an in-order architecture. While the illustrated embodiment of the processor also includes separate instruction and data cache units 134/174 and a shared L2 cache unit 176, alternative embodiments may have a single internal cache for both instructions and data, such as, for example, a L1 internal cache, or multiple levels of internal cache. In some embodiments, the system may include a combination of an internal cache and an external cache that is external to the core and/or the processor. Alternatively, all of the caches may be external to the core and/or the processor.
Thus, different implementations of the processor 200 may include: 1) a CPU with special purpose logic being integrated graphics and/or scientific (throughput) logic (which may include one or more cores), and the cores 202A-N being one or more general purpose cores (e.g., general purpose in-order cores, general purpose out-of-order cores, a combination of the two); 2) a coprocessor with the cores 202A-N being a large number of special purpose cores intended primarily for graphics and/or scientific (throughput); and 3) a coprocessor with the cores 202A-N being a large number of general purpose in-order cores. Thus, the processor 200 may be a general-purpose processor, coprocessor or special-purpose processor, such as, for example, a network or communication processor, compression engine, graphics processor, GPGPU (general purpose graphics processing unit), a high-throughput many integrated core (MIC) coprocessor (including 30 or more cores), embedded processor, or the like. The processor may be implemented on one or more chips. The processor 200 may be a part of and/or may be implemented on one or more substrates using any of a number of process technologies, such as, for example, BiCMOS, CMOS, or NMOS.
The memory hierarchy includes one or more levels of cache units 204A-204N (including L1 cache) within the cores, a set of one or more shared cache units 206, and external memory (not shown) coupled to the set of integrated memory controller units 214. The set of shared cache units 206 may include one or more mid-level caches, such as level 2 (L2), level 3 (L3), level 4 (L4), or other levels of cache, a last level cache (LLC), and/or combinations thereof. While in one embodiment a ring based interconnect unit 212 interconnects special purpose logic 208, the set of shared cache units 206, and the system agent unit 210/integrated memory controller unit(s) 214, alternative embodiments may use any number of well-known techniques for interconnecting such units. In one embodiment, coherency is maintained between one or more cache units 206 and cores 202A-N.
In some embodiments, one or more of the cores 202A-N are capable of multi-threading. The system agent unit 210 includes those components coordinating and operating cores 202A-N. The system agent unit 210 may include for example a power control unit (PCU) and a display unit. The PCU may be or include logic and components needed for regulating the power state of the cores 202A-N and the integrated graphics logic 208. The display unit may be for driving one or more externally connected displays.
The cores 202A-N may be homogenous or heterogeneous in terms of architecture instruction set; that is, two or more of the cores 202A-N may be capable of execution of the same instruction set, while others may be capable of executing only a subset of that instruction set or a different instruction set. In one embodiment, the cores 202A-N are heterogeneous and include both the “small” cores and “big” cores described below.
Referring now to
The optional nature of additional processors 315 is denoted in
The memory 340 may be, for example, dynamic random access memory (DRAM), phase change memory (PCM), or a combination of the two. For at least one embodiment, the controller hub 320 communicates with the processor(s) 310, 315 via a multi-drop bus, such as a frontside bus (FSB), point-to-point interface such as an Intel® QuickPath Interconnect (QPI), or similar connection 395.
In one embodiment, the coprocessor 345 is a special-purpose processor, such as, for example, a high-throughput MIC processor, a network or communication processor, compression engine, graphics processor, GPGPU, embedded processor, or the like. In one embodiment, controller hub 320 may include an integrated graphics accelerator.
There can be a variety of differences between the physical resources 310, 315 in terms of a spectrum of metrics of merit including architectural, microarchitectural, thermal, power consumption characteristics, and the like.
In one embodiment, the processor 310 executes instructions that control data processing operations of a general type. Embedded within the instructions may be coprocessor instructions. The processor 310 recognizes these coprocessor instructions as being of a type that should be executed by the attached coprocessor 345. Accordingly, the processor 310 issues these coprocessor instructions (or control signals representing coprocessor instructions) on a coprocessor bus or other interconnect, to coprocessor 345. Coprocessor(s) 345 accept and execute the received coprocessor instructions.
Referring now to
Processors 470 and 480 are shown including integrated memory controller (IMC) units 472 and 482, respectively. In addition, processors 470 and 480 include a dynamic profiling module (DPM) 475 and 485 respectively, details of which are described further below. Processor 470 also includes as part of its bus controller units point-to-point (P-P) interfaces 476 and 478; similarly, second processor 480 includes P-P interfaces 486 and 488. Processors 470, 480 may exchange information via a point-to-point (P-P) interface 450 using P-P interface circuits 478, 488. As shown in
Processors 470, 480 may each exchange information with a chipset 490 via individual P-P interfaces 452, 454 using point to point interface circuits 476, 494, 486, 498. Chipset 490 may optionally exchange information with the coprocessor 438 via a high-performance interface 439 using point-to-point interface circuit 492. In one embodiment, the coprocessor 438 is a special-purpose processor, such as, for example, a high-throughput MIC processor, a network or communication processor, compression engine, graphics processor, GPGPU, embedded processor, or the like.
A shared cache (not shown) may be included in either processor or outside of both processors, yet connected with the processors via P-P interconnect, such that either or both processors' local cache information may be stored in the shared cache if a processor is placed into a low power mode.
Chipset 490 may be coupled to a first bus 416 via an interface 496. In one embodiment, first bus 416 may be a Peripheral Component Interconnect (PCI) bus, or a bus such as a PCI Express bus or another third generation I/O interconnect bus, although the scope of the present invention is not so limited.
As shown in
Referring now to
Referring now to
Program code, such as code 430 illustrated in
The program code may be implemented in a high level procedural or object oriented programming language to communicate with a processing system. The program code may also be implemented in assembly or machine language, if desired. In fact, the mechanisms described herein are not limited in scope to any particular programming language. In any case, the language may be a compiled or interpreted language.
One or more aspects of at least one embodiment may be implemented by representative instructions stored on a non-transitory machine-readable medium which represents various logic within the processor, which when read by a machine causes the machine to fabricate logic to perform the techniques described herein. Such representations, known as “IP cores” may be stored on a tangible non-transitory, machine readable medium and supplied to various customers or manufacturing facilities to load into the fabrication machines that actually make the logic or processor. Accordingly, embodiments of the invention also include non-transitory, tangible machine-readable media containing instructions or containing design data, such as Hardware Description Language (HDL), which defines structures, circuits, apparatuses, processors and/or system features described herein. Such embodiments may also be referred to as program products.
In some cases, an instruction converter may be used to convert an instruction from a source instruction set to a target instruction set. For example, the instruction converter may translate (e.g., using static binary translation, dynamic binary translation including dynamic compilation), morph, emulate, or otherwise convert an instruction to one or more other instructions to be processed by the core. The instruction converter may be implemented in software, hardware, firmware, or a combination thereof. The instruction converter may be on processor, off processor, or part on and part off processor.
Referring now to
In any event,
To aid in determination of the N hot instructions, a threshold storage 820 is present. As seen, threshold storage 820 may store at least one threshold value within a threshold register 822. In an embodiment, this threshold value may be a count equal to the count value of the least hot of the N hot entries. And a corresponding pointer to this entry may be stored in pointer storage 824. Understand that in other embodiments, multiple sets of these threshold registers may be provided to enable multiple segmentations of instructions. For example, with two sets of threshold registers, a first portion of the N hot registers can be identified corresponding to the X top used instructions and an N−X remainder of the top N hot instructions can be associated with a second portion of the N hot registers. Of course many additional sets of registers and possibilities exist in other embodiments.
As further illustrated, DPM 800 further includes a DPM control logic 830, which may be configured to perform dynamic swap operations relating to dynamically updating the threshold value(s) as the number of instructions executed varies over time. In an embodiment, DPM control logic 830 may be implemented as a finite state machine (FSM) although other embodiments, including hardware circuitry, software and/or firmware are possible. As will be described herein, DPM control logic 830 may be configured to perform control operations with regard to the various entries within storage 805, as well as to perform processing on the resulting count information to identify hot instructions and send hint information associated with one or more of these hot instructions to one or more consumers. Still further as discussed herein in embodiments in which DPM 800 is implemented as a standalone component of a processor, DPM control logic 830 also may be configured to perform arbitration between the various cores or other processors to enable DPM 800 to be dynamically shared as a shared resource by multiple cores or other processing elements of the processor. Understand while shown at this high level in the embodiment of
Still with reference to
Threshold register 822 (which may include multiple registers) may be configured to hold the value of a threshold (set either by software or by a minimum count among the N top tagged instruction entries or multiples of N thereof), while pointer storage 824 may be configured to store the pointer to the entry of entries 810 having the minimum count. Comparator field 812 of each entry is used for comparing the incoming tagged instruction address with its entry address, and if there is a match then the counter value stored in count field 814 is incremented, e.g., by 1 for that entry. This update may also invoke a comparison of the count value with that of threshold value from the threshold register 822.
At initialization, the threshold value stored in threshold register 822 may be set to X (which is a software initialized value). In addition, all N*M entries 810, 815 are initialized to zero, both for tagged instruction address and count value. During operation, every tagged instruction address enters dynamic profiling module 800. If the tagged instruction address has not been profiled before, a new entry is created (e.g., in one of entries 815), and its corresponding counter is incremented and the count value is compared to the threshold value. Instead if the tagged instruction address is already being profiled through an entry in dynamic profiling module 800, that entry's corresponding counter is updated and the count value is compared to the threshold value. If any of the N top tagged instructions (or multiples of N thereof) have a minimum count value greater than the threshold (initialized to X at start by software), then threshold register(s) 822 are updated with the minimum count and pointer storage(s) 824 are updated to the entry that had minimum count. If any of non-N top tagged instructions has a count value greater than the threshold, then this initiates a swap operation in which this entry is swapped with the entry identified by pointer register 824 (namely the entry having the minimum count among N top tagged instructions). Also, threshold register(s) 822 are updated with the new minimum value.
Thus in an embodiment, there are two phases of operation in the dynamic profiling module per processor clock tick: phase 1, in which entry update is performed, which includes count update and comparison with a threshold post comparison of the incoming tagged instruction address with the address stored in the entry, and if the comparison returns a match, operation proceeds to phase 2. In phase 2, if any entry has a higher count than the threshold value, then a dynamic swap operation is performed. In an embodiment, the following operations may be performed in a dynamic swap: if the entry is not part of the N top tagged entries, then this entry is swapped with the entry indicated in pointer register 824 and the threshold value stored in threshold storage 822 is updated to the new minimum. If the entry is part of the N top tagged entries, then the entry with the minimum count among N top tagged entries will update the threshold registers (value and pointer, if needed).
In an embodiment, dynamic profiling module 800 may output, per processor clock tick, profiling information regarding the N top tagged instructions (or multiples of N). Of course hint information regarding fewer instructions may be sent instead per clock cycle. Understand that in different embodiments, dynamic profiling module 800 may be scalable in multiples of N. The minimum count value among N or multiples of N can be determined hierarchically. In an embodiment, the threshold value and pointer value stored in threshold storage 820 may be broadcast to N*M entries 810, 815 to enable the above determinations to occur.
Referring now to
As illustrated, method 900 begins by receiving a tagged instruction in a dynamic profiling circuit (block 910). Note that the terms “dynamic profiling circuit” and “dynamic profiling module” are used interchangeably herein to refer to hardware circuitry, software, firmware and/or combinations thereof to perform the dynamic profiling described herein. As discussed above, this tagged instruction may be received as part of an instruction stream during execution of a given process on one or more cores of a processor. Next it is determined at diamond 920 whether an entry is present in the dynamic profiling circuit for this tagged instruction. In an embodiment, this determination as to whether an entry is present may be based on at least some portion of the address associated with the instruction, which may be used by each of the entries to perform a comparison to determine whether a match exists for a given entry within the DPM. This entry may be one of the N hot entries or may be one of the additional entries associated with less hot instructions. If no entry is present, a new entry may be created for this tagged instruction (block 925). Typically, this created entry will be within one of the less hot additional entries of the DPM. In some embodiments, when all of the entries include instruction information already, an eviction process first may be performed to remove, e.g., an entry associated with the least recently used instruction or a cleanup process may be routinely performed to remove tagged instructions if not active for a given (e.g., threshold) period of time or periodic reset of the DPM.
Still with reference to
Still with reference to
Note that in embodiments herein, the DPM may be used during execution of code that has not been translated or instrumented (such as by a dynamic binary translator (DBT)), enabling applicability in a wide variety of situations, and without the overhead of translation time. Understand while shown at this high level in the embodiment of
Embodiments may identify select (tagged) instructions in different ways. In some embodiments, static analysis of code may be performed, e.g., during compilation. As an example, for a loop of code the choice for tagged instructions could be the first and last instructions of the loop body, along with a conditional instruction that checks the loop iteration. In nested loop code, the choice for tagged instructions could be the first and last instructions of the top loop body, or first and last instructions at several levels of the nested loop body, depending on the total number of instructions at various levels of the nested loop body.
Note that for functions, macros and other similar programming constructs that lead to a fixed body of code tagged instructions can be identified similar to loop constructs. In some embodiments, all instructions that are part of a recursion can be tagged.
In some embodiments, instructions can be classified into three bins: tagged instructions; non-tagged instructions; and conditionally tagged instruction; as described further below. Tagging instructions during static analysis of compilation may enable reduced resource consumption of a dynamic profiling module, which may be resource constrained.
Referring to Table 1 below, shown is an example static analysis of code during a compilation process to identify instructions suitable for dynamic profiling. More specifically, the loop code of Table 1 shows that the choice for tagged instructions could be the first and last instructions of a loop body and the instruction that determines the loop iterator.
For the above example of Table 1, it is sufficient to tag the loop first and last instructions. Also, note that the last instruction is linked to the first instruction of the loop so that the complete range of address between first and last instructions can be determined. In addition, the instruction that determines the loop iterator is tagged.
Tables 2A-2C show examples of static analysis for nested loop code. As seen in the different examples of these Tables, the choice for tagged instructions could be the first and last instructions of the top loop body or first and last instructions at several levels of the nested loop body, depending on the total number of instructions at various levels of the nest loop body.
For the above example of Table 2A, it is sufficient to tag the outer nested loop first and last instructions since the total nested loop instructions are not that many. And, the instruction that determines the outer loop iterator is also tagged.
For the above example of Table 2B, the outer and inner nested loop first and last instructions are tagged since the total nested loop instructions are many. And the outer and inner nested loop instructions that determine the loop iterations are tagged.
For the above example of Table 2C, the inner nested loop first and last instructions are tagged since the total inner nested loop instructions are very many. And the inner nested loop instruction that determines the loop iterator is tagged.
While shown with these representative examples for purposes of illustration, understand that embodiments are not so limited and other static-based analyses may be performed to identify instructions for tagging. Note also that in an embodiment, the number of resources available in hardware for dynamic profiling may be an input to the compiler to enable the compiler to select an appropriate subset of instructions for dynamic profiling adhering to hardware resource constraints.
As to binning instructions into three categories, note that tagged and non-tagged instructions can be determined during static analysis of compilation. Conditionally tagged instructions are those instructions that cannot be binned into tagged or non-tagged classification at compile time, since these instructions rely on run-time values in order to be considered either as tagged or non-tagged. In embodiments, these instructions may be classified as conditionally tagged instructions. Then during operation, run-time hardware may be configured to determine whether a conditionally tagged instruction should be tagged or not based on run-time value. For instance, a loop iterator instruction may be identified as a conditionally tagged instruction, in that a run-time variable of the instruction is one where a programmer has not provided pragmas to indicate the minimum value of the iterator. The run-time hardware may be configured to determine the value of the expression of iterator, and based on, e.g., a software settable threshold, this conditionally tagged instruction will be classified either as tagged or non-tagged. This hardware, based on the result of the execution of the iterator instruction, may flip the tag from “conditionally tagged” to “tagged” if the iterator value is higher than the threshold and “conditionally tagged” to “non-tagged” instruction if it is otherwise. In an embodiment, this hardware is located within the processor execution circuitry.
Table 3 below shows example code including a conditionally tagged instruction.
For the above example, since the value of x is not known at compile time, the instruction that determines the loop iterator is conditionally tagged. Also, the first and last instructions of the loop are conditionally tagged. And the last instruction is linked to the first instruction in the loop to enable identification of the complete range of addresses between first and last instructions.
Referring now to
As illustrated, method 1000 begins by analyzing an incoming instruction (block 1005). Next it is determined whether this instruction is part of a loop within the code (diamond 1010). If not, no further analyses occurs for this instruction, and accordingly, an instruction counter for the analysis tool can be incremented (block 1015) to enable control to pass back to block 1005 for analysis of a next instruction. Note that while described in the context of method 1000 as considering whether an instruction is part of a loop, understand that this determination may also consider whether the instruction is part of a function or recursive code.
If it is determined that the instruction is part of a loop, control passes to diamond 1020 to determine whether it is part of a nested loop. If so, control passes next to diamond 1025 to determine whether the number of nested loop instructions within this nested loop is less than a nested loop threshold. Although the scope of the present invention is not limited in this regard, in one embodiment this nested loop threshold (which can be dynamically set in some cases) may be between approximately 5 and 10.
If it is determined that the number of nested loop instructions is less than this nested loop threshold, control passes to block 1030 where further analysis of this nested loop may be bypassed. As such, control jumps to the end of this nested loop (block 1035). Thereafter, the instruction counter may be incremented (block 1040) so that the next instruction can be analyzed (as discussed above at block 1005).
Still with reference to
If instead it is determined that the one or more variables of the conditional instruction is not known at compile time (and thus are to be determined at run time), control passes to block 1065 where this instruction may be conditionally tagged. In an embodiment, an instruction can be conditionally tagged by setting a conditional tag indicator (single bit, namely 1) of the instruction or as stated above with two bits (10).
Still referring to
In most cases, an N hot tagged instruction leads to a triplet of instructions that are linked representing a loop or nested loops. This triplet includes the loop iterator instruction, first instruction in the loop body and the last instruction in the loop body. Given the triplet, there can be many instructions within the loop body that are not tagged but can be derived from the triplet. As will be described further below, a hint information consumer such as a caching structure may use this triplet to determine whether a specific instruction is not tagged but within a loop body. If so, that specific instruction may also be handled as an N hot instruction, such as storage into a second instruction cache portion. This basically means that a triplet of tagged instructions representing a loop present in the N hot tagged instructions (as analyzed in the dynamic profiling module), can in fact lead to 3+L instructions that may be specially cached, where L is the total instructions in the loop body minus 3 (triplet). There are also cases where the N hot tagged instructions lead to a pair of instructions that are linked, such as representing a hard macro having start and end instructions. The same logic above as to triplets applies to instructions that are within the start/end instructions of the hard macro. There are also cases where an N hot tagged instruction leads to a single instruction that is linked to no other instruction, representing a recursion. By tagging only pair of instructions in the case of hard macros and triplet in the case of (nested) loops, the amount of dynamic profiling hardware is minimized.
As discussed above, a dynamic profiler as described herein may produce at every clock tick updated dynamic profile information that can potentially be used for caching instructions that are most often used. Embodiments may apply filtering to this information, e.g., by low pass filtering the dynamic profile information in order to avoid any high frequency changes to the profile information, which could be an outlier and can cause negative impact on instruction caching.
In one particular embodiment, a moving average filtering technique (or other low pass filter or boxcar filter) may be used to filter dynamic profile information. Such filtering may ensure that any spurious high frequency outliers are removed before providing the low pass filtered dynamic profile information as hint information, e.g., to an instruction caching structure. Coupling a low pass filter in the path between the dynamic profile module and a hint consumer such as an instruction caching structure may ensure that received hint information enhances ISA performance (e.g., enabling caching the most often used instructions).
Referring now to
Different types of cache memories and cache memory hierarchies are possible. However, for purposes of discussion herein, assume that cache memory 1120 includes a first portion 1122 and a second portion 1124, where first portion 1122 is a dedicated cache storage for the N hot instructions, and second cache portion 1124 is an instruction cache for non-tagged instructions and tagged instructions that are outside of the current N hot instructions. As further seen, in embodiments migrations of instructions between these two caches are possible such that when a given tagged instruction within cache portion 1124 is elevated to one of the top N hot instructions, that cache line may be migrated to first cache portion 1122 (and similarly, a least used instruction that is replaced by this new incoming instruction is demoted to second cache portion 1122). Understand that to perform these migrations and further to leverage the hint information, cache memory 1120 may include a cache controller 1126, which may perform these migrations of instructions between the two caches, as well as additional cache control functions.
As further illustrated in FIG.11, processor 1100 further includes an execution circuit 1130, which may be implemented as one or more execution units to execute instructions received from cache memory 1120. Understand that while shown at this high level, many additional structures within a processor and core of a processor may be present in particular embodiments. However such structures are not shown for ease of illustration in
Referring to
Embodiments may improve ISA performance dynamically via non-intrusive dynamic profiling as described herein at least in part by caching most often used instructions, and further maintaining these instructions such that they are not frequently evicted. As discussed above, non-intrusive dynamic profiling as described herein may provide information regarding: instructions that are most often used and not part of any (nested) loop body but can be part of a recursion body or hard macros; and instructions that are most often used and are part of a loop body. Based on linking information present in the last instruction of a loop body linking it to the first instruction of the loop, a complete range of addresses between first and last instructions that constitute a loop can be determined. This information may be used to appropriately store more active instructions for longer durations in an instruction cache. As such for the case of loop instructions, where the first and last instructions are identified as most often used, non-tagged instructions of the loop body between these first and last instructions may be stored and controlled the same as these first and last instructions. Similar logic can be applied to hard macros for which first and last instructions are identified as most often used.
In different embodiments, there may be multiple manners of implementing an instruction caching structure to leverage the profiling hint information described herein. In a first embodiment, one or more separate structures may be provided for most often used instructions. In this embodiment, all instructions are fetched into a first or regular instruction cache regardless of whether they are most often used instruction or not. Based on hint information from the dynamic profiling module, instructions that are most often used then may be cached in a second or special instruction cache. Specifically, the most often used instructions can be dynamically migrated from the regular instruction cache that is the receiver of fetched instructions to the special instruction cache. This approach ensures that the most often used instructions are not evicted in case there is a tsunami of sequential code execution that can potentially evict most often used instructions.
In another embodiment, instead of providing special and regular instruction cache arrays, a single cache memory array may be provided for all instructions, with different portions allocated or locked for the most often used instructions. In one example, a set associative cache memory may be arranged with certain ways locked for use only with the most often used instructions. Such ways may be controlled so that the instructions stored therein are evicted only based on hint information received from the dynamic profiling module (and not based on least recently used or other conventional cache eviction schemes). With this configuration, with certain ways allocated for most often used instructions, all instructions are fetched and inserted into non-locked ways. Based on hint information from the dynamic profiling module, the cache structure can migrate most often used instructions from the non-reserved ways to the reserved ways, thereby protecting the most often used instructions from a potential tsunami of sequential code execution. In either case, dynamic hint information from the dynamic profiling module may be used to identify which set of instructions to specially cache and protect them from eviction.
In yet other embodiments, a cache structure may include a separate storage for decoded instructions, referred to as a decoded instruction cache or decoded streaming buffer). Such separate storage may be used to store decoded instructions that are often used, so that front end units such as instruction fetch and decode stages can be bypassed. Embodiments may control a decoded instruction storage to only store N hot decoded instructions, to improve hit rate.
Eviction from the special instruction cache or locked way of an instruction cache is only when the cache is full, and new hint information (for a new hot instruction) arrives from the dynamic profiling module. In an embodiment, the size of the special instruction cache or the number of ways of an instruction cache locked for storing most often used instructions may be set at a maximum or multiples of N (where N is the top N hot tagged instructions). Understand that in other cases, the expense of a dynamic profiling module can be avoided by directly using compiler tagging to cache tagged instructions based on static analysis. However, in the interest of adding benefits of dynamic profiling, a potentially smaller sized cache may be used to ensure access to the most used instructions
Referring now to
As illustrated, method 1300 begins by receiving hint information from a dynamic profiling circuit (block 1310). In an embodiment, this hint information may include address information and corresponding counts, e.g., of the top N instructions, to thus identify to the cache memory the most active instructions. Next, control passes to block 1320 where an instruction is received in the instruction cache. For example, this instruction may be received as a result of an instruction fetch, prefetch or so forth. Note that the ordering of blocks 1310 and 1320 may be flipped in certain cases.
In any event, control passes to block 1330 where this instruction is stored in a first instruction cache portion. That is, in embodiments described herein a caching structure that is to leverage the hint information can be controlled to provide different portions associated with tagged and non-tagged instructions. For example, different memory arrays may be provided for, at least, the top N hot instructions. In other examples, these separate cache portions may be implemented as certain dedicated ways of sets of the cache memory only for storage of tagged instructions.
In any event, at block 1330 this instruction is stored in a first cache portion, where this first cache portion is associated with non-tagged instructions. Next, control passes to diamond 1340 to determine whether this instruction is associated with a top N instruction. This determination may be based on comparison of address information of this instruction to address information of the hint information. Note that if the instruction itself is one of the top N hot instructions, a match occurs. In other cases, this determination can be based on determining that the instruction, although not tagged itself, is within a loop associated with tagged instructions.
If it is determined that this instruction is not associated with a top N instruction, no further operations occur with regard to this instruction within the cache and thus this instruction remains in the first instruction cache portion. Otherwise, if it is determined that this instruction is associated with a top N instruction, control passes to block 1350 where the instruction may be migrated to a second instruction cache portion. As described above, this second cache portion may be a separate memory array dedicated for hot instructions or a given way of a set dedicated to such hot instructions. As part of this migration it may be determined whether this second instruction cache portion is full (diamond 1360). If so, control passes to block 1370 where a less used instruction is migrated from this second cache portion to the first instruction cache portion. From both of diamond 1360 and block 1370, control passes to block 1380 where the instruction is stored in the second instruction cache portion. Understand while shown at this high level in the embodiment of
As discussed above in some cases, a dynamic profiling module may be provided within or associated with each core of a multicore processor. In other cases, such circuitry can be shared for use by multiple cores or other processing engines, to provide a solution for efficient dynamic profiling infrastructure.
With one or more shared dynamic profiling modules as described herein, each core, when it employs the dynamic profiling infrastructure, will reach a steady state, e.g., with respect to benefiting from increased instruction cache hit rate based on the hint information provided by the dynamic profiling infrastructure. In embodiments, this steady state can be used as a trigger condition to either turn off the dynamic profiling module or switch use of the dynamic profiling infrastructure to another core or other processing engine of the SoC or other processor. Since the dynamic profiling infrastructure is independent of the processor architecture, it can be seamlessly used as dynamic profiling infrastructure for any processor architecture. In this way homogenous and heterogeneous processor architectures may benefit by efficient reuse with regard to a dynamic profiling infrastructure.
In an embodiment, a core, when it has an instruction cache hit rate that falls below a certain threshold, may be configured to issue a request to use the shared dynamic profiling infrastructure. To this end, a request queue may be provided to store these requests. In turn, the dynamic profiling infrastructure may access this request queue (which may be present in a control logic of the DPM, in an embodiment) to identify a given core or other processing element to select for servicing. In some embodiments, a priority technique may be used in which a core can issue a request with a given priority level based on a level of its instruction cache hit rate. And in turn, the shared dynamic profiling infrastructure may include priority determination logic (e.g., within the DPM control logic) to choose an appropriate core (or other processor) for use of the infrastructure based at least in part on the priority levels.
Referring now to
As further illustrated in
Referring now to
Control next passes to block 1520 where the dynamic profiling circuit can be configured for the identified core. For example, this configuration may include dynamically controlling switching of the dynamic profiling circuit to the given core to enable communication of hint information to the core from the dynamic profiling circuit, as well as to provide an instruction stream which includes address (with links in the case of (nested) loops and hard macros) from the core to the dynamic profiling circuit.
Still with reference to
As a core begins to run in steady state while leveraging hint information as described herein to dynamically control its instruction cache memory, its instruction cache hit rate may increase over time. Thus as illustrated, at diamond 1560 it can be determined whether this instruction cache hit rate exceeds a given hit rate threshold. Although the scope of the present invention is not limited in this regard, in an embodiment this hit rate threshold may be between approximately 90 and 95%. If the core instruction cache hit rate does not exceed this hit rate threshold, this is an indication that execution of the program on the core has not reached steady state. As such, additional use of the dynamic profiling circuit to generate hint information for the identified core may continue at block 1530. Otherwise if it is determined that the instruction cache hit rate for the core exceeds the hit rate threshold, this is an indication that the dynamic profiling circuit can be used by another core, e.g., according to a given arbitration policy. Understand while shown at this high level in the embodiment of
The following examples pertain to further embodiments.
In one example, a processor includes: a plurality of cores; a plurality of caches associated with the plurality of cores; a dynamic profiler to identify a plurality of instructions having an activity level greater than a threshold level, the dynamic profiler a shared resource of the processor; and a controller to dynamically enable one or more of the plurality of cores to access the dynamic profiler, where the controller is to enable the dynamic profiler to provide hint information regarding the plurality of instructions to a first core of the plurality of cores.
In an example, the controller is to dynamically enable the first core to access the dynamic profiler in a time multiplexed manner.
In an example, the controller is to dynamically enable the first core to access the dynamic profiler when a hit rate of the first core with respect to an instruction cache is less than a threshold.
In an example, the dynamic profiler comprises: a storage having a plurality of entries to store count information regarding the plurality of instructions; and a comparator to compare the count information from one of the plurality of entries to the threshold level, the dynamic profiler to output the count information from the one of the plurality of entries when the count information from the one of the plurality of entries exceeds the threshold level.
In an example, the dynamic profiler is to dynamically adapt the threshold level based at least in part on the count information from at least one of the plurality of entries.
In an example, the storage includes N×M entries and the dynamic profiler is to store the count information regarding N most frequently accessed instructions in a first subset of the N×M entries, and migrate a first entry of the N×M entries to an entry of the first subset of the N×M entries when the count information associated with the first entry of the N×M entries exceeds count information of the first subset of the N×M entries associated with a least accessed instruction of the N most frequently accessed instructions.
In an example, the plurality of caches comprises a plurality of instruction caches, where a first instruction cache of the plurality of caches includes a first portion dedicated to store the N most frequently accessed instructions and a second portion to store other instructions of the process.
In an example, the processor further comprises a filter to receive the count information regarding the plurality of instructions and filter the count information to provide the hint information regarding the at least some of the plurality of instructions to the first core.
In another example, an apparatus comprises: a profiling circuit to profile tagged instructions of code in execution, the profiling circuit to output hint information of at least a first portion of the tagged instructions for an evaluation interval; a filter coupled to the profiling circuit to receive the hint information and filter the hint information to output filtered hint information; and an instruction cache including a controller to receive the filtered hint information and store a first set of instructions of the code into a first portion of the instruction cache based on the filtered hint information.
In an example, the filter is to prevent the hint information of a first tagged instruction from being sent to the instruction cache when a count value associated with the hint information of the first tagged instruction deviates from a stored count value associated with the first tagged instruction.
In an example, the filter comprises a low pass filter.
In an example, the filter is to receive the hint information and send the hint information of a first tagged instruction to the instruction cache if a prior count value of the first tagged instruction is at least substantially equal to a current count value associated with the first tagged instruction included in the hint information.
In another example, a system comprises: a multicore processor and a system memory. The multicore processor may include: a plurality of cores to execute code including tagged instructions and non-tagged instructions; a plurality of instruction caches including a first instruction cache associated with a first core of the plurality of cores, the first instruction cache having a first portion to store a first subset of the tagged instructions and a second portion to store a second subset of the tagged instructions and at least some of the non-tagged instructions; a DPM to identify the first subset of the tagged instructions as having an access count greater than a threshold level; and a controller to dynamically enable at least some of the plurality of cores to access the DPM. The controller may be configured to enable the DPM, for a first duration, to receive an instruction stream of the code from the first core, maintain an access count of the tagged instructions of the instruction stream and output hint information regarding the first subset of the tagged instructions to the first instruction cache, the first subset of the tagged instructions having the access count greater than the threshold level.
In an example, the DPM is to be dynamically shared by the at least some of the plurality of cores in a time multiplexed manner.
In an example, the controller comprises a priority determination circuit to select the first core to access the DPM based at least in part on a priority of the first core.
In an example, the priority is based at least in part on a cache hit rate of the first instruction cache.
In an example, the system further comprises a filter coupled to the DPM to receive the hint information regarding the first subset of the tagged instructions and send the hint information associated with a first instruction to the first instruction cache if a prior count value associated with the first instruction is at least substantially equal to a current count value associated with the first instruction included in the hint information.
In an example, the system further comprises a static compiler to compile the code and identify some instructions of the code as the tagged instructions and identify at least one other instruction of the code as a conditionally tagged instruction.
In an example, the first core includes a run-time hardware circuit to analyze the conditionally tagged instruction and identify the conditionally tagged instruction as a tagged instruction when a run-time variable associated with the conditionally tagged instruction exceeds a threshold.
In yet another example, an apparatus comprises: profiling means for profiling tagged instructions of code in execution, the profiling means to output hint information of at least a first portion of the tagged instructions for an evaluation interval; filter means coupled to the profiling means to receive the hint information and filter the hint information to output filtered hint information; and instruction cache means including control means for receiving the filtered hint information and storing a first set of instructions of the code into a first portion of the instruction cache means based on the filtered hint information.
In an example, the filter means is to prevent the hint information of a first tagged instruction from being sent to the instruction cache means when a count value associated with the hint information of the first tagged instruction deviates from a stored count value associated with the first tagged instruction.
In an example, the filter means is to receive the hint information and send the hint information of a first tagged instruction to the instruction cache means if a prior count value of the first tagged instruction is at least substantially equal to a current count value associated with the first tagged instruction included in the hint information.
In a still further example, a method comprises: profiling tagged instructions of code in execution and outputting hint information of at least a first portion of the tagged instructions for an evaluation interval; receive the hint information and filtering the hint information to output filtered hint information; and receiving the filtered hint information and storing a first set of instructions of the code into a first portion of an instruction cache based on the filtered hint information.
In an example, the method further comprises preventing the hint information of a first tagged instruction from being sent to the instruction cache when a count value associated with the hint information of the first tagged instruction deviates from a stored count value associated with the first tagged instruction.
In an example, the method further comprises receiving the hint information and sending the hint information of a first tagged instruction to the instruction cache if a prior count value of the first tagged instruction is at least substantially equal to a current count value associated with the first tagged instruction included in the hint information.
In another example, a computer readable medium including instructions is to perform the method of any of the above examples.
In another example, a computer readable medium including data is to be used by at least one machine to fabricate at least one integrated circuit to perform the method of any one of the above examples.
In another example, an apparatus comprises means for performing the method of any one of the above examples.
Understand that various combinations of the above examples are possible.
Note that the terms “circuit” and “circuitry” are used interchangeably herein. As used herein, these terms and the term “logic” are used to refer to alone or in any combination, analog circuitry, digital circuitry, hard wired circuitry, programmable circuitry, processor circuitry, microcontroller circuitry, hardware logic circuitry, state machine circuitry and/or any other type of physical hardware component. Embodiments may be used in many different types of systems. For example, in one embodiment a communication device can be arranged to perform the various methods and techniques described herein. Of course, the scope of the present invention is not limited to a communication device, and instead other embodiments can be directed to other types of apparatus for processing instructions, or one or more machine readable media including instructions that in response to being executed on a computing device, cause the device to carry out one or more of the methods and techniques described herein.
Embodiments may be implemented in code and may be stored on a non-transitory storage medium having stored thereon instructions which can be used to program a system to perform the instructions. Embodiments also may be implemented in data and may be stored on a non-transitory storage medium, which if used by at least one machine, causes the at least one machine to fabricate at least one integrated circuit to perform one or more operations. Still further embodiments may be implemented in a computer readable storage medium including information that, when manufactured into a SoC or other processor, is to configure the SoC or other processor to perform one or more operations. The storage medium may include, but is not limited to, any type of disk including floppy disks, optical disks, solid state drives (SSDs), compact disk read-only memories (CD-ROMs), compact disk rewritables (CD-RWs), and magneto-optical disks, semiconductor devices such as read-only memories (ROMs), random access memories (RAMs) such as dynamic random access memories (DRAMs), static random access memories (SRAMs), erasable programmable read-only memories (EPROMs), flash memories, electrically erasable programmable read-only memories (EEPROMs), magnetic or optical cards, or any other type of media suitable for storing electronic instructions.
While the present invention has been described with respect to a limited number of embodiments, those skilled in the art will appreciate numerous modifications and variations therefrom. It is intended that the appended claims cover all such modifications and variations as fall within the true spirit and scope of this present invention.