The field of invention pertains to the computing sciences generally, and, more specifically, to processor scheduling with thread performance estimation on cores of different types.
The memory controller 104 reads/writes data and instructions from/to system memory 108. The I/O hub 105 manages communication between the processor and “I/O” devices (e.g., non volatile storage devices and/or network interfaces). Port 106 stems from the interconnection network 102 to link multiple processors so that systems having more than N cores can be realized. Graphics processor 107 performs graphics computations. Power management circuitry (not shown) manages the performance and power states of the processor as a whole (“package level”) as well as aspects of the performance and power states of the individual units within the processor such as the individual cores 101_1 to 101_N, graphics processor 107, etc. Other functional blocks of significance (e.g., phase locked loop (PLL) circuitry) are not depicted in
The instruction fetch stage 201 fetches “next” instructions in an instruction sequence from a cache, or, system memory (if the desired instructions are not within the cache). Instructions typically specify operand data and an operation to be performed on the operand data. The data fetch stage 202 fetches the operand data from local operand register space, a data cache or system memory. The instruction execution stage 203 contains a set of functional units, any one of which is called upon to perform the particular operation called out by any one instruction on the operand data that is specified by the instruction and fetched by the data fetch stage 202. The write back stage 204 “commits” the result of the execution, typically by writing the result into local register space coupled to the respective pipeline.
In order to avoid the unnecessary delay of an instruction that does not have any dependencies on earlier “in flight” instructions, many modern instruction execution pipelines have enhanced data fetch and write back stages to effect “out-of-order” execution. Here, the respective data fetch stage 202 of pipelines 250, 260 is enhanced to include data dependency logic 205 to recognize when an instruction does not have a dependency on an earlier in flight instruction, and, permit its issuance to the instruction execution stage 203 “ahead of”, e.g., an earlier instruction whose data has not yet been fetched.
Moreover, the write-back stage 204 is enhanced to include a re-order buffer 206 that re-orders the results of out-of-order executed instructions into their correct order, and, delays their commitment to the physical register file at least until a correctly ordered consecutive sequence of instruction execution results have retired. In order to further support out-of-order execution, results held in the re-order buffer 206 can be fed back to the data fetch stage 202 so that later instructions that depend on the results can also issue to the instruction execution stage 203.
The enhanced instruction execution pipeline is also observed to include instruction speculation logic 207 within the instruction fetch stage 201. Instruction sequences branch out into different paths depending on a condition such as the value of a variable. The speculation logic 207 studies the upcoming instruction sequence, guesses at what conditional branch direction or jump the instruction sequence will take (it guesses because the condition that determines the branch direction or jump may not have been executed or committed yet) and begins to fetch the instruction sequence that flows from that direction or jump. The speculative instructions are then processed by the remaining stages of the execution pipeline.
Here, the re-order buffer 206 of the write back stage 204 will delay the commitment of the results of the speculatively executed instructions until there is confirmation that the original guess made by the speculation logic 207 was correct. Once confirmation is made that the guess was correct, the results are committed to the architectural register file. If it turns out the guess was wrong, results in the re-order buffer 206 for the speculative instructions are discarded (“flushed”) as is as the state of any in flight speculative instructions within the pipeline 200. The pipeline 200 then re-executes from the branch/jump with the correct sequence of instructions.
The following description and accompanying drawings are used to illustrate embodiments of the invention.
The number of logic transistors manufactured on a semiconductor chip can be viewed as the semiconductor chip's fixed resource for processing information. A characteristic of the processor and processing core architecture discussed above with respect to
The dedication of logic circuitry to the speed-up of currently active threads is achieved, however, at the expense of the total number of threads that the processor can simultaneously process at any instant of time. Said another way, if the logic circuitry units of a processor were emphasized differently, the processor might be able to simultaneously process more threads than a processor of
Because of the additional logic needed to support out-of-order execution, the out-of-order processing cores 301_1 to 301_X may be bigger than the non out-of-order execution cores 310_1 to 310_Y. Limiting the number of out-of-order execution cores “frees-up” more semiconductor surface so that, e.g., comparatively more non out-of-order execution cores can be instantiated on the die. So doing permits the processor as a whole to concurrently execute more threads.
An issue with having different core types on a same processor is overall processor throughput. Notably, the performance of certain types of threads may be noticeably “sped-up” if run on an “out-of-order” core whereas other types of threads may not be. Ideally, the former types of threads are run on the “out-of-order” core(s) while the later types of threads are run on the “non-out-of-order” cores.
Those of ordinary skill will appreciate that the methodology of
Note that all these examples simplify the cores as being able to execute only a single thread. As is understood in the art, each core may be multi-threaded by way of simultaneous execution (e.g., with more than one instruction execution pipeline per core) and/or concurrent execution (e.g., where a single instruction execution pipeline switches multiple threads in-and-out of active execution over a period of time). The number of measurements, estimations and thread switching options to different core type scales with the number of threads supported by the individual cores.
Previous work by others has focused on the intensity of memory access instructions within the thread (“memory intensity”) to guide workload scheduling. This policy is based on the intuition that compute-intensive workloads benefit more from the high computational capabilities of an out-of-order core while memory-intensive workloads execute more efficiently on a non out-of-order core while for memory.
While memory intensity alone can provide a good indicator for scheduling some memory-intensive workloads onto a non-out-of-order core, such practice can significantly slowdown other memory-intensive workloads. Similarly, some compute-intensive workloads observe a significant slowdown on a non-out-of-order core while compute-intensive workloads have reasonable slowdown when executing on a small core. This behavior illustrates memory intensity (or compute intensity) alone is not a good indicator to guide application scheduling on different types of cores.
The performance behavior of workloads on non-out-of-order and out-of-order cores can be explained by the design characteristics of each core. Out-of-order cores are particularly suitable for workloads that require instruction level parallelism (ILP) to be extracted dynamically or have a large amount of simultaneously outstanding misses (memory level parallelism (MLP)). On the other hand, non out-of-order cores are suitable for workloads that have a large amount of inherent ILP (that is, ILP that need not be realized with out-of-order execution). This implies that performance on different cores can be correlated to the amount of MLP and ILP prevalent in a thread. For example, consider a memory-intensive workload that has a large amount of MLP. Executing such a memory-intensive workload on a non-out-of-order core can result in significant slowdown if the core does not expose the MLP. On the other hand, a compute-intensive workload with large amounts of inherent ILP may have only a modest slowdown on a non-out-of-order core and need not require the out-of-order core.
As such, in various embodiments, the slowdowns (or speedups) when moving threads between different core types can be correlated to the amount of MLP and ILP realized on a target core. As such, the performance on a target core type can be estimated by predicting the MLP and ILP on that core.
According to one embodiment, referring to
As such, in one embodiment, the measured CPI of the respected cores (processes 403a,b of
CPInon-out-of-order=average CPIbase
CPIout-of-order=average CPIbase
where: 1) average CPIbase
Likewise, the estimated CPI of the respected cores (processes 404a,b of
CPI_ESTnon-out-of-order=estimate of CPI of a thread currently executing on an out-of-order core if it were to be executed on a non out-of-order core=CPI_ESTbase
CPI_ESTout-of-order=estimate of CPI of a thread currently executing on a non out-of-order core if it were to be executed on an out-of-order core=CPI_ESTbase
Respective equations and a discussion of each of the individual terms in Equations 2a and 2b are provided immediately below.
With respect to Equation 2a, CPI_ESTbase
In an embodiment, CPI_ESTbase
CPI_ESTbase
where IPC_ESTbase-out-of-order corresponds to the estimated average number of base component instructions executed per clock cycle on the out-of-order core which is expressed in Eqn. 9 of the Appendix and is restated here as Eqn. 3b:
which corresponds to the expected number of instructions issued per cycle on the non out-of order core.
Here, various instructions execution pipelines are often capable of simultaneously issuing more than one instruction for a thread at a given time. Eqn. 3b essentially attempts to estimate how many instructions will be issued in parallel for the thread if the thread were to be executed on the non-out-of-order core. Of import here is that instructions that have dependencies on one another will not issue in parallel. That is, the instruction execution pipeline will prevent an instruction from issuing in a same cycle of an earlier instruction in the stream that it has a dependency on.
Here, i X P[IPC=i] in Eqn. 3b corresponds to the probability that i instructions will issue in parallel and Wnon-out-of-order corresponds to the width of the non out-of-order core. The probability of issuing i instructions in parallel can be determined through observance of the “dependency distance” (D) of the thread as it is executing on the out-of-order core. Here, the dependency distance D is essentially a measure of the number of instructions that typically reside between a later instruction that has a dependency on the result of an earlier instruction. The reader is referred to the Appendix, sec. 3.2.2 (“Predicting small core ILP on a big core”) for more details concerning this calculation.
In an embodiment, CPI_ESTmem
CPImem
where CPImem
In an embodiment, MLP_ESTnon-out-of-order in Eqn. 3c above is calculated as follows for a “stall-on-use” core:
MLP_ESTnon-out-of-order=MPIout-of-order×D Eqn. 3d
where MPIout-of-order is the number of memory access instructions that resulted in a cache miss per instruction as observed for the thread as it is executing on the out-of-order core (e.g., as calculated by dividing the number of memory access instructions that resulted in a cache miss with the total number of instructions for the thread over a set time period) and D is the dependency distance for the thread as it is executing on the out-of-order core (e.g., as calculating by tracking dependency distance for each instruction having a dependency and taking the average thereof).
A stall on use core is a core that will stall the thread if an instruction has been issued but not yet executed because the data it needs is not yet available. In an embodiment MLP_ESTnon-out-of-order=1 for a “stall-on-miss” core. A stall-on-miss core will stall a thread if a memory access instruction suffers a miss when looking into core cache(s) or core and processor cache(s) requiring a data fetch outside the core or processor.
With respect to Equation 2b above, CPI_ESTbase
CPI_ESTbase
where Wout-of-order is the width of the out-of-order core (that is, the number of instructions that the out-of-core can issue in parallel). Here, an out-of-order core by design attempts to avoid stalls associated with data dependency issues by issuing instructions from deeper back in the instruction queue that do not have any dependency issues on current in flight instructions.
In an embodiment, CPI_ESTmem
CPImem
where CPImem
In an embodiment, MLP_ESTout-of-order is expressed as:
MPInon-out-of-order(ROBsize) Eqn. 4c
where MPInon-out-of-order is the number of memory access instructions that resulted in a cache miss per instruction as observed for the thread as it is executing on the non-out-of-order core (e.g., as calculated by dividing the number of memory access instructions that resulted in a cache miss with the total number of instructions for the thread over a set time period) and ROBsize is the size of the reorder buffer of the out-of-order core.
If the scheduling intelligence 530 identifies a pair of threads executing on different cores where migration of the non-out-of-order core thread to the out-of-order core would cause noticeable speedup but migration of the out-of-order core thread to the non-out-of-order core would not cause noticeable slowdown the pair of threads are migrated to different respective cores. In order to migrate threads, their respective state which consists of data and control information in the respective register space of the different cores (not shown) is switched between cores. As such, switching circuitry 505 (e.g., a switching network) resides between the cores of different types to effect the switchover of state information for the two threads between the two cores. If threads are not switched internally within the processor, the respective state of switching threads may be saved externally (e.g., to system memory) and then loaded back into their new respective cores.
In an embodiment each core has scheduling intelligence circuitry to not only determine (or at least help determine) a measurement for the performance of the thread that it is executing but also the determine the estimated measurement of performance for the thread if it were to execute on a core of different type.
Although not explicitly labeled, not that the processor 500 of
The teachings herein are also supplemented with Appendix materials appearing at the end of this application. Notably, the Appendix refers to an out-of-order core as a “big” core and a non-out-of-order core as a “small” core.
As any of the logic processes taught by the discussion above may be performed with a controller, micro-controller or similar component, such processes may be implemented with program code such as machine-executable instructions that cause a machine that executes these instructions to perform certain functions. Processes taught by the discussion above may also be performed by (in the alternative to the execution of program code or in combination with the execution of program code) by electronic circuitry designed to perform the processes (or a portion thereof).
It is believed that processes taught by the discussion above may also be described in source level program code in various object-orientated or non-object-orientated computer programming languages. An article of manufacture may be used to store program code. An article of manufacture that stores program code may be embodied as, but is not limited to, one or more memories (e.g., one or more flash memories, random access memories (static, dynamic or other)), optical disks, CD-ROMs, DVD ROMs, EPROMs, EEPROMs, magnetic or optical cards or other type of machine-readable media suitable for storing electronic instructions. Program code may also be downloaded from a remote computer (e.g., a server) to a requesting computer (e.g., a client) by way of data signals embodied in a propagation medium (e.g., via a communication link (e.g., a network connection)).
In the foregoing specification, the invention has been described with reference to specific exemplary embodiments thereof. It will, however, be evident that various modifications and changes may be made thereto without departing from the broader spirit and scope of the invention as set forth in the appended claims. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.
Number | Name | Date | Kind |
---|---|---|---|
20120198459 | Bohrer et al. | Aug 2012 | A1 |
20140281402 | Comparan et al. | Sep 2014 | A1 |
Entry |
---|
Kenzo VanCraeynest, Aamer Jaleel, Lieven Eeckhout, Paolo Narvaez, Joel Emer, Scheduling Heterogenous Multi-cores through Performance Impact Estimation (PIE), pp. 20-31. |
Number | Date | Country | |
---|---|---|---|
20140282565 A1 | Sep 2014 | US |