This invention relates to profiling circuitry for profiling software execution on a processor.
It can be useful for software developers to be able to determine which parts of a program take the most time for a processor to execute. Instruction-level software profiling tools can allow a developer to determine this down to the level of individual processor instructions. To support such profiling, a processor may have associated profiling circuitry, comprising hardware logic for sampling a currently-executing instruction. Sampling at regular intervals, as a program executes, and counting how often each instruction appears in the collected sample data, it can be statistically determined how much time the processor is spending executing each instruction.
However, modern processors typically execute program instructions out of program order and with instruction-level parallelism (i.e. superscalar), making it difficult to collect meaningful statistics. A processor that supports out-of-order instruction execution, e.g. according to a Tomasulo algorithm, typically has a reorder buffer for tracking the state of all inflight instructions, as instructions need to be committed in program order to provide precise exceptions. The modifications that an instruction makes to the architectural state maintained by the processor becomes visible to software when the instruction is committed.
Embodiments of the present invention seek to provide profiling circuitry that supports more accurate profiling on processors, including, but not limited to, processors that execute instructions out of program order.
From a first aspect, the invention provides profiling circuitry for a processor, wherein the profiling circuitry comprises:
From a further aspect, the invention provides a processing system comprising:
From a further aspect, the invention provides a method for instruction-level profiling comprising:
Thus it will be seen that, in accordance with embodiments of the invention, the instruction that was most recently committed (referred to herein as “last-committed instruction” or “LCI”) can be sampled, in addition to the one or more instructions that are next to be committed. This contrasts with naïve approaches in which only the next-committing instruction (referred to herein as “next-committing instruction” or “NCI”)—e.g. the instruction that is at the head of a reorder buffer at each sample point—is sampled, or in which only the last-committed instruction is sampled. By being able to sample both the next-committing instruction(s) and the last-committed instruction, profiling circuitry can enable more accurate profiles to be created of software executed on the processor, as explained in greater detail below. In particular, it can allow more accurate indications of which instructions are consuming the most processing time.
The processor may support execution of instructions out of program order. However, this is not essential and, in other embodiments, the processor is configured always to execute instructions in program order.
The processor may be a scalar or superscalar processor.
The information stored (and used) by the processor for committing inflight instructions in program order may be stored in commit-information circuitry, which may be a reorder buffer (e.g. in embodiments in which the processor supports execution of instructions out of program order), or a scoreboard (e.g. in embodiments in which the processor always executes instructions in program order), or other appropriate circuitry such as hazard detection logic (e.g. in embodiments in which the processor has a five-stage pipelined architecture).
The state-determining circuitry is preferably configured to determine when a reorder buffer or pipeline of the processor has been flushed (e.g. due to a branch misprediction or to handle an exception). It may be configured to determine this from a state of the reorder buffer or pipeline—e.g. from one or more flush bits associated with each reorder buffer entry. This approach is compatible with existing processors, since even processors that flush wrong-path instructions immediately upon discovering that a branch is mispredicted typically include a flush bit in the reorder buffer to implement instructions that always flush (e.g. fence instructions).
The state-determining circuitry may be configured to determine which of the following four states the processor is in: computing, stalled, drained or flushed. It may be configured to signal the state to the sampling circuitry and/or to include state data encoding the state in the sample data.
The first commit state may be or include that a reorder buffer or pipeline of the processor contains one or more instructions (e.g. with the processor actively computing or stalled). It may be or include that a reorder buffer or pipeline of the processor has drained (e.g. due to an instruction cache miss or an instruction Translation Lookaside Buffer (TLB) miss). The first commit state may be or include both of these situations—i.e. when a reorder buffer or pipeline contains one or more instructions or has drained. However, the first commit state preferably excludes the reorder buffer or pipeline being in a flushed state. Thus, the first commit state may comprise the computing state, the stalled state and the drained state.
The second commit state may be that the reorder buffer or pipeline has been flushed (e.g. due to a branch misprediction or an exception). Thus, the second commit state may be the flushed state. By identifying the last-committed instruction in such a situation, the sample data identifies the instruction that caused the misprediction (i.e. the branch instruction that was mispredicted), thus allowing a statistical profile to more accurately assign clock cycles to the instructions that are the cause of processing delays. This may be useful for accurate profiling not just on out-or-order processors, but also on in-order processors, as it can allow profiling software to take account of the fact that some in-order processors sometimes flush the pipeline in response to an instruction being committed.
The sampling circuitry may be configuring to output successive sample data at output instants separated by output intervals. Respective sample data may thus be associated with respective output instants, which may correspond to different respective processor clock cycles. The output intervals may be regular intervals or may be irregular (e.g. pseudorandom) intervals. In some embodiments, the sampling circuitry may output sample data (e.g. to a sample register) at every processor clock cycle, although it may output the sample data less often than this. In such embodiments, software (e.g. profiling software executing on the processor or a further processor) or other circuitry (e.g. a performance management unit (PMU)) may read all or a fraction of the sample data; for example, it may fetch, at regular or irregular collection intervals, the latest sample data stored in the sample register or memory.
In some embodiments, the sampling circuitry may be configured, when the processor is in the first commit state, to output sample data that does not identify the last-committed instruction (i.e. that only identifies the one or more or all instructions that are next to be committed by the processor). It may be configured, when the processor is in a second commit state, to output sample data that does not identify any next-committing instruction (i.e. that only identifies the instruction that was last committed by the processor).
However, in other embodiments, the sampling circuitry may be configured to output sample data, associated with a common output instant or processor clock cycle, that identifies one or more instructions that are next to be committed by the processor and that also identifies an instruction that was last committed by the processor. In this case, profiling software may use state data output by the state-determining circuitry to discard instructions that are not relevant and to attribute a sample count only to one or more NCIs or only to the LCI, depending on the state data.
The state-determining circuitry may be configured to output state data to the sample register or memory that is representative of the commit state of the processor. In this way, a performance monitoring unit (PMU) and/or profiling software may subsequently determine the commit state of the processor by examining the output state data, associated with any given output event. Profiling software may use the state data to determine whether to count only one next-committing instruction (e.g. only the oldest such instruction), or every next-committing instruction, or the last-committed instruction, identified in the sample data associated with the output event, when generating a statistical profile across a plurality output events. However, in other embodiments, profiling software may only receive identification of either the NCI(s) or LCI (but not both), at each output event, depending on the processor commit state.
In some preferred embodiments, the sampling circuitry is configured—at least in some situations, such as when the processor is in a computing state (e.g. not stalled) and will commit a plurality of instructions at the next commit cycle—to identify a plurality of instructions that are next to be committed by the processor in the next commit cycle (i.e. in a common processor clock cycle). The sample data may identify all of said plurality of next-committing instructions. The sample data may identify every next-committing instruction for the common clock cycle.
This contrasts with a naïve approach in which only at most a single instruction, from the head of a reorder buffer, is identified, with any other instructions that will be committed simultaneously (i.e. by a superscalar processor) not being counted. This can lead to a biased profile resulting in inaccurate insights into which instructions are consuming most clock cycles. By identifying multiple next-committing instructions, embodiments can enable profiling software to generate more accurate profiles using the output data, e.g. by assigning an equal fractional clock-cycle count to each next-committing instruction, at least when the processor is in a computing state.
Thus, in some embodiments, the sampling circuitry is configured, when the processor is in a computing state (or in a drained state, for some embodiments) and will commit a plurality of instructions in the next commit cycle, to output sample data to the sample register or memory that identifies said plurality of instructions that are to be committed by the processor in the next commit cycle. This enables profiling software to assign count values to each of these instructions. In particular, profiling software may assign positive sample counts to all of the NCIs when in the computing state. While it could also do so for the drained state, in preferred embodiments, profiling software attributes a sample count only to the single oldest (i.e. first-committing) instruction when there is the plurality of NCIs. When the processor is in a computing state or drained state and has only a single instruction to commit next, the sampling circuitry may output sample data identifying said next-committing instruction.
The sampling circuitry may further be configured, when the processor is in a stalled state, to output sample data to the sample register or memory that identifies a single instruction that is next to be committed by the processor. The enables profiling software to assign a count value just to this one instruction. At least when stalled, this may be the oldest instruction in the reorder buffer or pipeline—i.e. the instruction at the head of the reorder buffer or pipeline. (In some embodiments, in which the processor comprises a plurality of reorder buffer banks, the sampling data may identify a plurality of instructions, e.g. one instruction from each bank, but then additionally comprising data indicating which one of these instructions is the oldest instruction.) In the drained state, it is possible the next-committing instruction might not yet be in the reorder buffer or pipeline.
In such embodiments, the sampling circuitry will be configured, when the processor is in a flushed state, to output sample data to the sample register or memory that identifies an instruction that was last committed by the processor.
Some embodiments may advantageously implement the sometimes idea of identifying a plurality of NCIs without necessarily also being configured to identify last-committed instructions when in a second state.
Thus, from another aspect, the invention provides profiling circuitry for a processor, wherein the profiling system comprises sampling circuitry configured to identify a plurality of instructions that are to be committed by the processor in a common processor clock cycle, and to output sample data to a sample register or a memory that identifies all of the plurality of instructions.
From a further aspect, the invention provides a processing system comprising:
From a further aspect, the invention provides a method for instruction-level profiling comprising identifying a plurality of instructions that are to be committed by a processor in a common processor clock cycle, and writing sample data to a sample register or a memory that identifies all of the plurality of instructions.
Any features of embodiments embodying the preceding aspects may be features of embodiments of these aspects also, and vice versa. The sampling circuitry may be configured to identify such a plurality of next-committing instructions whenever the processor is in a computing state and will commit a plurality of instructions at the next commit cycle.
The sampling circuitry may be configured to write sample data to the sample register or memory (i.e. to generate a sample) in response to an internal event timer within the sampling circuitry, but in a preferred set of embodiments it is configured to do so in response to receiving a command from outside the sampling circuitry—e.g. from a PMU.
In some embodiments of any of the aspects disclosed herein, the sampling circuitry may output the sample data directly to a volatile or non-volatile memory (e.g. RAM) at every output instant. The processing system may comprise a memory (e.g. RAM), to which the sampling circuitry may write the sample data. However, in other embodiments, the profiling circuitry comprises a sample register (i.e. comprising a plurality of flip-flops), to which the sampling circuitry outputs the sample data (at least initially) at every output instant. The sample register may be sized for storing data identifying a plurality of instructions—e.g. for storing a plurality of instruction memory addresses. It may be sized for storing data identifying a plurality of next-committing instructions. It may be sized for storing data identifying at least one next-committing instruction and a last-committed instruction. It may be sized for storing data identifying at least as many instructions as the commit width of the processor. The sample register or memory may be implemented in any appropriate way, and may be split across multiple distinct regions or locations.
The profiling circuitry (or the processing system more widely) may comprise a performance monitoring unit (PMU), which may be arranged to collect some or all of the sample data from the sample register at regular or irregular sampling intervals. It may write the collected sample data to a volatile or non-volatile memory (e.g. RAM or hard-drive). The sampling intervals may correspond to a sampling rate that is lower than an output rate at which the profiling circuitry updates the sample register. Alternatively, the profiling circuitry may be controlled to output the sample data to the sample register at the same rate as the PMU collects the sample data. The sampling circuitry may be configured to inform the PMU that new sample data has been written to the sample register. The PMU may be configured to trigger an interrupt of the processor after collecting the sample data. The interrupt may invoke an interrupt handler in profiling software stored in a memory of the processing system.
More generally, the processing system may comprise a memory storing profiling software comprising instructions for processing at least some of the sample data to generate an instruction-level profile of software executed by the processor. The profiling software may be executed by the same processor (although it could be executed by a different processor of the processing system in some embodiments).
The profiling software may comprise instructions for analysing a software application executed by the processor. The profiling software may comprise instructions for determining count values for instructions of the software. It may comprise instructions to increment a count value for an instruction when that instruction is identified as a next-committing instruction and the processor is in the first commit state. It may comprise instructions to increment count values for one or more instructions by equal amounts (e.g. equal fractional amounts when there are a plurality of instructions) when the instructions are identified as next-committing instructions for a common clock cycle and the processor is in a computing state. However, it may comprise instructions to increment a count value of only an oldest of a plurality of next-committing instructions for a common clock cycle when the processor is in a stalled state. Similarly, it may comprise instructions to increment a count value of only an oldest of a plurality of next-committing instructions for a common clock cycle when the processor is in a drained state. It may comprise instructions to increment a count value for an instruction when that instruction is identified as the last-committed instruction and the processor is in the second commit state.
The sampling circuitry may be configured to output sample data that identifies which is an oldest of a plurality of next-committing instructions identified in the sample data. This may be used for accurately allocating clock cycles when the processor is in a stalled state. The processor may have a reorder buffer comprising a plurality of banks (e.g. if the processor is a Berkeley Out-of-Order Machine processor), and the sampling circuitry may be configured to access a pointer to the head bank and a pointer to the head column and to use these pointers to determine the oldest next-committing instruction.
The sampling circuitry may be configured to output sample data that identifies which is a youngest of a plurality of next-committing instructions identified in the sample data. This may be used for accurately allocating clock cycles when the reorder buffer or pipeline is in a flushed state. It may store the address of the youngest such instruction in a register (e.g. an “offending-instruction register”), which it may update on every processor clock cycle, and may copy the address from the offending-instruction register to the sample register or memory, e.g. in response to detecting there is no valid instruction at the head of the reorder buffer and/or that the processor is in the flushed state.
The sample data may identify instructions in any appropriate way. It may comprise an address of each identified instruction.
The sampling circuitry may be configured to exclude from the sample data any invalid next-committing instruction (e.g. any instruction identified as not valid in the reorder buffer). Alternatively, it may be configured to output sample data that identifies all next-committing instructions (i.e. valid and invalid) and that also identifies which of the next-committing instructions are valid (e.g. by setting a respective binary “validity” flag in the sample register, associated with each respective next-committing instruction).
The sampling circuitry may be configured to output a stalled signal, for a processor clock cycle, (e.g. as part of the sample data) if the reorder buffer or pipeline contains one or more instructions, but no instructions are being committed in the clock cycle.
The sampling circuitry may be configured to output a flushed signal, for a processor clock cycle, (e.g. as part of the sample data) if an instruction has triggered a flush of the reorder buffer or pipeline.
The processor and profiling circuitry may, in some embodiments, be implemented as a semiconductor integrated circuit, e.g. on a silicon chip.
Features of any aspect or embodiment described herein may, wherever appropriate, be applied to any other aspect or embodiment described herein. Where reference is made to different embodiments or sets of embodiments, it should be understood that these are not necessarily distinct but may overlap.
Certain preferred embodiments of the invention will now be described, by way of example only, with reference to the accompanying drawings, in which:
Various example embodiments are described in detail below, and experimental results are provided which demonstrate the effectiveness of the disclosed approaches.
First, some background and context is provided, to help the reader better understand the embodiments and some of the advantages they provide over other approaches.
Software developers can use tools that automatically attribute processor execution time to source code constructs, such as instructions, basic blocks, and functions, in order to improve the efficiency of the software they write.
Software profilers generate performance profiles for software applications. A performance profile statistically attributes processor execution time to different application-level symbols. Depending on the use case, developers can select symbols at different granularities, including: functions, basic blocks, and individual instructions.
Gathering profiles without hardware support is inherently inaccurate. We refer to this herein as a “Software” approach. Software-level profilers (e.g. Linux perf, operating in a default configuration) interrupt the application and retrieve the address of the instruction that execution will resume from after the interrupt has been handled. Hence the current inflight instructions will drain before the interrupt handler is executed which means that the sampled instruction can be tens or even hundreds of instructions away from the instruction(s) that the processor was committing at the time the sample was taken. This phenomenon is known as “skid” and can be addressed by adding hardware support for instruction sampling. Known hardware-based profiling circuitry includes Intel™ PEBS, AMD™ IBS, and Arm™ SPE.
Hardware-supported profiling enables sampling in-flight instructions without interrupting the application and hence eliminates skid by effectively removing the latency from sampling decision to sample collection. Hardware profilers rely on sampling, which involves collecting and logging instruction addresses at regular time intervals. The theory is that the number of sample counts assigned to a particular instruction over a profiling period indicates how much time the processor spent executing that instruction. However, modern processors implement speculative processing in which instructions are executed out-of-order, which can reduce the accuracy of known profiling approaches.
Although all hardware profilers use such sampling, they differ in their policies for selecting which instruction to attribute a sample point to, at each sample interval.
A first approach, used by Intel's Processor Event-Based Sampling (PEBS), is to return the address of the next instruction that commits after the sample is taken. We will refer to this herein as a “next-committing instruction (NCI)” heuristic.
A second approach, used by profilers that use debug interfaces, such as Arm CoreSight, is to systematically sample the last-committed instruction. We refer to this herein as a “last-committed instruction (LCI)” heuristic.
A third approach, used by AMD's Instruction-Based Sampling (IBS) and Arm's Statistical Profiling Extension (SPE), is first to tag an instruction at dispatch and then to retrieve the sample when the instruction commits. Unlike the commit-focused approaches, this enable gathering data about how this instruction flows through the processor back-end. We refer to this as a “Dispatch” heuristic.
However, none of these heuristics is accurate under all situations.
In order to analyse the effectiveness of different approaches, we present a “golden reference” profiler, which we refer to herein as the “Oracle” profiler. Although this profiler embodies aspects of the invention, and is possible to implement on a device, it is not expected to be used widely in practice, due to the huge amount of data it processes: rather than using a statistical sampling approach, the Oracle profiler samples every single instruction as it is committed by a processor. A fundamental principle behind the Oracle profiler is the recognition that an accurate profiler must perform time-proportional attribution, i.e. with every clock cycle being attributed to the particular instruction or instructions whose latency is exposed by the processor. The Oracle profiler focuses on the processor's commit stage, because this is where the latency cost of each instruction is resolved and becomes visible to software. More specifically, the best-case instruction latency in a processor that can commit w instructions per cycle is 1/w cycles-meaning that the processor has been able to hide all of the instruction latency except for 1/w cycles. If the processor is unable to fully hide an instruction's execution latency, the instruction will stall at the head of the reorder buffer (ROB) and thereby block forward progress; i.e. the time commit blocks is the instruction's contribution to the application's execution time.
The Oracle profiler provides a benchmark for establishing the accuracy of naïve hardware performance profiling approaches. Section 4, below, provides details of an experimental setup and error metric that can be used for such benchmarking.
However, the Oracle profiler is not a practicable approach for everyday use, due to the very large volume of data that it outputs. We therefore present below a sampling-based profiler, which we call the “Time-Proportional Instruction Profiler (TIP)”, that also implements a principle of focusing on the commit stage of an associated processor, but which also uses periodic sampling, rather than collecting comprehensive data as the Oracle approach does, in order to generate a manageable amount of statistical data for analysis.
TIP combines the time-attribution policies of Oracle with statistical sampling, thereby reducing the amount of profiling data by several orders of magnitude compared to Oracle (e.g. generating sample data at a rate of 192 KB/s versus 179 GB/s for the Oracle profiler, at a 4 KHz sampling frequency). The use of sampling has the potential to introduce statistical error; however, we have determined that the amount of error introduced in practice is often negligible, as shown by
The two graphs in
While low profile error is attractive, a primary benefit of accurate performance profiling comes from helping developers write more efficient software applications. To illustrate that TIP's accuracy matters in practice, we used TIP and NCI to analyze the SPEC CPU2017 benchmark Imagick. Although both TIP and NCI are accurate at the function-level (0.3% and 0.6% average error, respectively), the function-level profile does not clearly identify the performance problem; this is a challenge with function-level profiles as developers use functions to organize functionality rather than performance. At the instruction-level, TIP correctly attributed time to Control and Status Register (CSR) instructions that cause pipeline flushes whereas NCI misattributed execution time to the next-committing instruction (see Section 6 for details). Interestingly, Imagick does not need to execute the CSR instructions, and replacing them with nop instructions yielded a 1.93 times speed-up compared to the original, mostly due to the second-order effect that removing flushes improves the processor's ability to hide latencies.
Practical performance profilers rely on statistical sampling to create a profile, i.e. they randomly retrieve the address of a currently executing instruction. Embodiments disclosed herein rely on the realisation that, since sampling is random in time, in order to get accurate sampling, the probability of sampling an instruction—and time hence being attributed to it—should be proportional to the instruction's impact on overall execution time. We refer to this principle as time-proportional attribution. Consider for example a processor that executes a single instruction at a time: an instruction that takes two clock cycles to execute should be attributed twice as much time as a single-cycle instruction.
Understanding why sampling at the commit stage enables time-proportional attribution requires going into some detail on how an out-of-order processor operates. Out-of-order processors consist of an in-order front-end that fetches and decodes instructions, predicts branches, performs register renaming, and finally dispatches instructions to the reorder buffer (ROB) and to the issue queues of the appropriate execution unit. Then, instructions are executed as soon as their inputs are available (possibly out-of-order). Instructions are typically committed in program order to support precise exceptions, and the ROB is used to track instruction order.
Sampling at commit enables time-proportional attribution because this is where an instruction's execution becomes visible to software, and significantly is also where its latency impact on overall execution time becomes visible. Sampling at commit is a necessary but not sufficient condition for achieving time-proportional attribution because the profiler must also attribute time to the instruction that the processor spends time on. For example, the time spent resolving a mispredicted branch must be attributed to the branch and not some other instruction. We find that none of the naïve profiling approaches we considered do time-proportional attribution. In particular, Dispatch and Software do not sample at commit, while NCI and LCI misattribute time. Section 2.1 exemplifies why not sampling at commit is inaccurate, while Section 2.2 explains why the Oracle profiler (and hence the TIP profiler which is derived from it) does time-proportional attribution, and why NCI and LCI do not.
Dispatch sampling selects the instruction to be profiled at the dispatch stage and then tracks it through the processor back-end. While this provides interesting insight regarding how an individual instruction progresses through the pipeline, it is not time-proportional.
Software profiling is also not time-proportional due to a phenomenon referred to as skid. As with Dispatch, long-latency instructions lead to commit stalls that attract samples, but, unlike Dispatch, Software attributes time to instructions that are fetched around the time the sample is taken. The reason is that Software relies on interrupts. Upon an interrupt, the processor stores the application's current Program Counter (PC) and transfers control to the interrupt handler which then attributes the sample to the instruction address in the PC. Software hence tends to attribute latency to instructions that are even further away from the stalled instruction in the instruction stream than Dispatch.
In this section, we present Oracle, which is time-proportional by design, i.e. it attributes each clock cycle during program execution to the instruction(s) which the processor exposed the latency of in this cycle. While NCI and LCI both sample at commit, they employ different instruction selection policies. More specifically, NCI samples the next-committing instruction, whereas LCI samples the last-committed instruction, and we will now explain why neither policy is time-proportional.
Oracle leverages the fundamental insight that the commit stage is in one of four possible states in each clock cycle, which we refer to herein as: Computing, Stalled, Flushed, or Drained. As shown in
We now explain Oracle in more detail for the four fundamental states and compare against NCI and LCI to explain in which cases NCI and LCI do or do not misattribute clock cycles.
In the computing state, Oracle accounts 1/n cycles to each committed instruction where n is the number of instructions committed in that cycle (i.e. n is a number between 1 and the processor's commit width).
Not acknowledging instruction-level parallelism (ILP) within the commit stage renders the NCI and LCI profiles difficult to interpret. The key reason is that many applications execute similar instruction sequences over and over. Since NCI and LCI select instructions to sample with a fixed policy, they will be biased towards selecting certain instructions at the expense of others. It is hence difficult for developers to ascertain if a latency difference between instructions in straight-line code segments is due to a performance issue (e.g. some instructions stalling more than others) or attribution bias.
Pipeline flushes occur when the processor has speculatively fetched (and possibly executed) instructions that should not be committed.
While the above attribution policy is sufficient to handle other misspeculation cases such as load-store ordering (i.e. a younger load was executed before an older store to the same address), flushes due to exceptions are handled differently. More specifically, an exception fires when the excepting instruction reaches the head of the ROB which in turn results in the pipeline being flushed and control transferred to the OS exception handler. When the exception has been handled (e.g. the missing page has been installed in the page table), the excepting instruction is re-executed. Hence, Oracle attributes the cycles where the ROB is empty due to an exception to the instruction that caused the exception. Once the instructions of the exception handler are dispatched, the Oracle attributes cycles to these instructions (i.e. the Oracle does not differentiate between application and system code).
The ROB drains when the processor runs out of instructions to execute, for instance due to an instruction cache miss. This situation differs from pipeline flushes in that all instructions to be drained from the ROB are on the correct path and hence will be executed and committed.
We have so far discussed the four fundamental states of the commit stage as if they are independent states. However, instructions often accumulate cycles across multiple states. For example, within the example of
Serialized instructions require that (i) all prior instructions have fully executed before they are dispatched, and (ii) that no other instructions are dispatched until they have committed. While the ROB drains, Oracle will account time to the preceding instructions according to the time they spend at the head of the ROB. When the last preceding instruction commits, the serialized instruction is dispatched and hence immediately becomes the oldest in-flight instruction. Oracle hence accounts time to this instruction as Stalled while it executes and as Computing the cycle it commits. Once it has committed, the subsequent instruction is dispatched and Oracle will account it as Stalled while it executes.
Another example is a page miss on a load instruction. In this case, the load accesses the data TLB and L1 data cache in parallel. This results in a TLB miss which invokes the hardware page table walker. Eventually, the page table walker concludes that the requested page is not in memory which causes the exception bit to be set in the load's ROB-entry. If the load reaches the head of the ROB before the page table walk completes, the Oracle starts accounting time as stalled. When the page table walk completes, the load is marked as executed and the exception is triggered once it reaches the head of the ROB. The cycles from the exception to dispatching the first instruction in the OS exception handler are attributed to the load. Once the OS has handled the exception by installing the missing page in memory, the load is re-executed. The load will then incur more stall cycles as it waits at the ROB head for its page mapping to be installed in the TLB and its data to be fetched from memory.
We now build upon the cycle-level attribution insights of Oracle to describe a practical and accurate Time-Proportional Instruction Profiler (TIP). The profiler can be implemented as hardware circuitry associated with a processor, which may be largely of conventional design. The profiler circuitry and the processor may, in some embodiments, be integrated on a single semiconductor chip.
Some exemplary embodiments use a CPU 51 comprising a Berkeley Out-of-Order Machine (BOOM) core. This includes a ROB 54, containing b banks. Up to one instruction per bank can be committed in each clock cycle (i.e. b is the commit width). Instructions are allocated to banks in the order of the bank identifiers. The instruction in bank i is hence always older than the instruction in bank i+1 within a column, but the b oldest ROB-entries may be distributed across two columns, as shown in
When the ROB 54 is not empty, TIP 52 simply copies the addresses of the head ROB-entries into its address registers (see label “1” in
If the processor is neither committing nor stalling, the ROB is empty due to a flush or a drain. TIP's OIR Update unit 56 continuously tracks the last-committed and last-excepting instruction from the ROB 54, as shown in
If the ROB is not empty due to a flush, it must have drained (see
We have designed TIP to interface cleanly with Linux perf. When using hardware support for profiling, perf configures the PMU 53 to collect samples at a certain frequency (4 KHz is the default), and TIP issues an interrupt when the sampling procedure has completed and a valid sample has been written to the CSRs 58. This interrupt invokes perf's interrupt handler which simply copies the profiler's CSRs 58 into a memory buffer; the profile is written to non-volatile storage when the buffer is full. At the end of application execution, perf has written the raw samples to a file which then needs to be post-processed. To build the profile, the profiling software may use a data structure in which a zero-initialized counter is assigned to each unique instruction address in the profile. For each sample, it may then add 1/n of the value in the cycles register to each instruction's counter when the sample contains n instructions. It also tracks the total number of cycles to enable normalizing the profile.
However, in other embodiments, some or all of these operations (e.g. copying the CSRs 58 into a memory buffer) could be implemented in hardware, by the TIP circuitry, rather than being implemented in software.
To help developers understand why some instructions take longer than others, the software portion of TIP may combine the information provided by the status flags with analysis of the application binary. It may label cycles where the application is committing one or more instructions as execution cycles, and cycles where the ROB has drained as front-end cycles. If the processor 51 is stalled, TIP may use the application binary to determine the instruction type and to indicate if the oldest instruction is an ALU-instruction, a load, or a store. Moreover, it may differentiate between flushes due to branch mispredicts and miscellaneous flushes based on TIP's status flags. (It may group the miscellaneous flushes as they typically only account for a small fraction of application execution time on average.)
While this categorization is suitable for some purposes, such as those described below, it will be appreciated that TIP can be adapted to support more fine-grained categories if necessary.
TIP is extremely lean as it can mostly rely on functionality that is already present in a conventional ROB or PMU. The storage overhead of TIP is the OIR register 55 (64-bit address and a 3-bit flag) and the CSRs 58 (i.e. cycle, flags, and b address CSRs); we merge all TIP flags into a single CSR 58. In an exemplary implementation, all CSRs are 64-bit since RISCV's CSR instructions operate on the full architectural bit width, resulting in an overall storage overhead of fifty-seven bytes for our 4-wide BOOM core (nine bytes for the OIR and forty-eight bytes for the six CSRs). The logic complexity for collecting the samples is also minimal; the main overhead is two multiplexors, one to select the youngest ROB-entry in the OIR Update unit 56 and one to choose between the OIR 55 and the address in ROB-bank 0 in the Sample Selection unit 57. TIP's logic is not on the critical path of the BOOM core. If appropriate, the logic can be pipelined.
In some embodiments, TIP interrupts the core when a new sample is ready. Another possible approach is for TIP to write samples to a buffer in memory and then interrupt the core once the buffer is full. This requires more hardware support (i.e. inserting memory requests and managing the memory buffer), but reduces the number of interrupts. However, the interrupts become longer (as more data needs to be copied), so the total time spent copying samples is similar.
For each sample, perf reads the operating-system (OS) kernel structures to determine key metadata including core, process, and thread identifiers which account for forty bytes per sample in total. For our 4-wide BOOM core, the non-ILP-aware profilers (e.g. NCI) capture a single instruction address and the cycle counter (an additional 16 bytes) whereas TIP captures four instruction addresses, the cycle counter, and the flags CSR (an additional forty-eight bytes). At perf's default 4 KHz sampling frequency, TIP hence generates data at 352 KB/s whereas the data rate of the non-ILP-aware profilers is 224 KB/s.
To quantify the performance overhead of TIP, we compared PEBS' default sample size (i.e. fifty-six bytes per sample) to a configuration with TIP-sized samples on an Intel Core i7-4770. We mimicked TIP by including additional general-purpose registers from the PEBS record to reach TIP's eighty-eight byte sample size. We found that the increased data rate of TIP adds negligible overhead. More specifically, it increased application runtime by 1.1% compared to a configuration with profiling disabled; the performance overhead with PEBS' default sample size was 1.0%.
Although we have so far described TIP in the context of single-threaded applications, this is not a fundamental limitation. More specifically, perf adds the core, process, and thread identifiers to each sample; the core identifier maps to a logical core under Simultaneous Multithreading (SMT). Apart from this, TIP will attribute time to one or more instructions as in the single-threaded case. For example, if a physical core is committing instruction I1 on logical core C1 and instruction I2 on logical core C2 in the same cycle, TIP attributes half of the time to I1 and half to I2. Each physical core needs its own TIP unit.
We used the FireSim cycle-accurate FPGA-accelerated full-system simulator to evaluate the different performance profiling strategies. The simulated model used the BOOM 4-way superscalar out-of-order core, configured as in the table below, which ran a common buildroot 5.7.0 Linux kernel.
The BOOM core was synthesized to and run on the FPGAs in Amazon's EC2 F1 nodes. We accounted for the frequency difference between the FPGA-realization of the BOOM core and the FPGA's memory system using FireSim's token mechanism. We enabled the hardware profilers when the system boots and profiled until the system shuts down after the benchmark has terminated. However, we only included the samples that hit application code in our profiles, as (i) the time our benchmarks spend in OS code (e.g. syscalls) is limited (1.1% on average), and (ii) we do not want to include boot and shutdown time in the profiles.
We modified FireSim to trace out the instruction address and the valid, commit, exception, flush, and mispredicted flags of the head ROB-entry in each ROB bank every cycle; the trace includes the ROB's head and tail pointers which we need to model Dispatch. We feed this trace to a highly parallel framework on the CPU-side to enable on-the-fly processing with only minimal simulation slowdown. The profilers are hence modeled on the CPUs that operate in lock-step with the FPGA by processing the traces. This allows us to simulate and evaluate multiple profiler configurations out-of-band in a single simulation run; we run up to nineteen profiler configurations on eight CPUs per FPGA simulation run. For the results described herein, the total time spent on Amazon EC2 amounted to 5,459 FPGA hours and 30,778 CPU hours. We evaluated multiple profilers with a single simulation run because (i) it enables fairly comparing profilers as they sample in the exact same cycle, and (ii) it reduces the evaluation time (and cost) on Amazon EC2.
We ran twenty-seven SPEC CPU2017 and PARSEC 3.0 benchmarks that are compatible with our setup, the names of which are listed along the horizontal axis of
The benchmarks' execution characteristics are shown in
Practical sampling profilers incur inaccuracies compared to the (impracticable) exhaustive Oracle approach since they rely on statistical sampling and hence record a small percentage of instruction addresses, which are then attributed to symbols in the application binary; the symbols are individual instructions, basic blocks or functions, depending on profile granularity. There are two fundamental sources of error.
Unsystematic errors occur because sampling is random and the distribution of sampled symbols does not exactly match the distribution obtained with Oracle. Unsystematic errors can be reduced by increasing sampling rate, as we will quantify in the evaluation. Systematic errors, on the other hand, occur because the profiling strategy attributes samples to the wrong symbol. We focus on systematic error in the evaluation by quantifying to what extent the different profilers attribute samples to the correct symbol as determined by the Oracle. Because we sample the exact same cycle for all the practical profilers in a single simulation run, we can precisely quantify and compare a profiler's systematic error.
Each sample is taken as a representative for the entire time period since the last sample. By comparing the symbol the sample is attributed to by the practical profiler against the symbol identified by Oracle, we determined whether a sample is correctly or incorrectly attributed. By aggregating the cycles correctly attributed to symbols (i.e. ccorrect) and relating this to the total number of cycles it takes to execute the application (i.e. ctotal), we can compute the relative error e (i.e. e=(ctotal−ccorrect)/ctotal). Error is a lower-is-better metric varying between 100% and 0%, where 100% means that all samples were incorrectly attributed, while 0% means that the practical profiler attributes each sample to the same symbol as Oracle. Profile error can be computed at any granularity, i.e. instruction, basic block, or function level; incorrect attribution at lower granularity can be correct at higher granularity (e.g. misattributing a sample to an instruction within the function that contains the correct instruction). We aggregated errors across benchmarks using the arithmetic mean.
We compared the following profilers:
We compared against Oracle which attributes every cycle to the symbol at the profiling granularity of interest, using the policy described in Section 2.2. As mentioned before, the error differences between the hardware profiling strategies (i.e. all profilers except Software) are due to systematic inaccuracies only as we sample in the exact same cycle. We assume periodic sampling at a typical sampling frequency of 4 KHz, unless mentioned otherwise. We explore the impact of periodic versus random sampling and the impact of sampling frequency in our sensitivity analyses.
Correctly attributing samples to functions does not necessarily mean that a performance analyst will be able to identify the most performance-critical basic blocks. We hence need to dive deeper and evaluate our profilers at the basic block level.
It is also interesting to note that the error is higher at the basic block level compared to the function level; and this is true for all profilers. The most striking example is Ibm: the LCI's function-level error is merely 0.3% and then increases to 56.1% at the basic block level. The reason is that a single function accounts for 99.7% of Ibm's total runtime, which means that an incorrect attribution at the basic block level most likely still leads to a correct attribution at the function level. This reinforces our claim that fine-granularity profiles are critical as knowing that 99.7% of runtime is spent in a (non-trivial) function is too high-level to clearly identify optimization opportunities.
In order to effectively understand and mitigate bottlenecks, performance analysts need profiling information that is even more detailed than the basic block (and function) level, with performance stranglers identified at the instruction level.
There are two reasons why TIP is the most accurate profiler. First, we observe a significant decrease in profile error when comparing NCI versus TIP-ILP for the flush-intensive benchmarks (see
We performed various sensitivity analyses with respect to sampling rate, sampling method, and commit-ILP accounting. We focused on instruction-level profiling and considered the most accurate profilers only, namely TIP, TIP-ILP, and NCI.
The default sampling rate was set to 4 KHz. We focused on unsystematic error by evaluating how profiling error varies with sampling frequency from 100 Hz to 20 KHz, as shown in
The sampling method used so far assumes periodic sampling, i.e. we take a sample every 250 μs (sampling frequency of 4 KHz). Periodic sampling may lead to an unrepresentative profile if the sampling frequency aligns unfavorably with the application's time-varying execution behavior (cf. Shannon-Nyquist sampling theorem). Random sampling may alleviate this by selecting a random sample within each 250 μs sampling interval.
TIP is more accurate than NCI because it correctly accounts for pipeline flushes and commit parallelism. Our results show that the biggest contribution comes from correctly attributing commit parallelism, i.e. compare the decrease in average instruction-level profile error from 9.3% (NCI) to 7.2% (TIP-ILP) due to correctly attributing pipeline flushing, versus the decrease in profile error from 7.2% (TIP-ILP) to 1.6% (TIP) due to attributing commit parallelism. We considered whether accounting for commit parallelism in NCI would yield a level of accuracy that is similar to TIP, and we hence made NCI commit parallelism-aware by simply attributing 1/n of the sample to the n next-committing instructions.
We used FireSim for our evaluation because the profilers considered in this work are platform-specific, hence it is impossible to compare the different profilers without re-implementing on a common platform. To evaluate our experimental setup, we conducted a validation experiment for the most accurate naïvely implemented profiler, namely NCI. Lacking an Oracle profiler on real hardware platforms, we had to compare the relative difference among existing profilers to gauge their accuracy. In particular, we compared Linux perf against PEBS on an Intel i-4770 system, versus our implementations of the Software profiler and NCI in FireSim, respectively. Obviously, one cannot expect a perfect match because we are comparing across instruction-set architectures (x86-64 versus RISC-V) and thus benchmark binaries. Yet, we still verified that the relative difference (computed using our error metric) between the respective profilers indeed fell within the same broad ranges across our set of benchmarks, both at the instruction level and function level. At the instruction level, the difference between PEBS and perf on Intel amounted to 69% on average versus 57% on FireSim when comparing NCI versus Software. At the function level, the difference equalled 4% versus 7%, respectively.
We performed a case study on the SPEC CPU2017 benchmark Imagick to illustrate how TIP pinpoints the root cause of performance issues.
TIP, on the other hand, correctly reported that most of the time in ceil is spent on the frflags and fsflags instructions, and the purpose of these instructions is to mask any changes to the floating-point status register that may occur within the function from the calling code. These instructions are hence necessary if the calling code relies on ceil being side-effect free. Interestingly, Imagick never reads the floating-point status register which means that the masking performed within ceil is unnecessary. Moreover, the floor function suffers from exactly the same problem. We optimized Imagick's binary code by replacing frflags and fsflags in ceil and floor with nop instructions to remove the unnecessary status register operations, hence creating a new, optimized version.
Interestingly, the speedup is (much) higher than expected based on the fraction of time spent executing the frflags and fsflags instructions (see
TIP can help developers understand how time is distributed across instructions, by precisely attributing time to individual instructions. It can potentially also help support additional performance analysis such as vertical profiling (which combines hardware performance counters with software instrumentation to profile an application across deep software stacks); call-context profiling (which efficiently identifies the common orders functions are called in); and causal profiling (which is able to identify the criticality of program segments in parallel codes by artificially slowing down segments and measuring their impact).
TIP can be straightforwardly implemented by integrating some additional circuitry with an out-of-order core. It is therefore useful for practical implementation, in contrast to purely simulation- and modelling-based approaches such as FirePerf (which uses FireSim to non-intrusively gather extensive performance statistics), which generate too much data to be practicably implementable outside a simulator.
We have presented our Oracle profiler, as a novel golden reference for performance profiling, and used it to show that naïve profiler approaches fall short because they are not time-proportional, as they lack support for instruction-level parallelism, and systematically misattribute instruction latencies. We have described a Time-Proportional Instruction Profiler (TIP) which combines the attribution policies of Oracle with statistical sampling to enable practical implementation. Experimental data shows TIP is highly accurate (e.g. having an average instruction-level error of 1.6%), and can be used to improve software, as evidenced by having been used to identify a performance issue in the SPEC CPU2017 benchmark Imagick that, once addressed, yielded a 1.93-factor speed-up.
It will be appreciated by those skilled in the art that the invention has been illustrated by describing various specific embodiments thereof, but is not limited to these embodiments; many variations and modifications are possible, within the scope of the accompanying claims.
Number | Date | Country | Kind |
---|---|---|---|
2113678.3 | Sep 2021 | GB | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2022/076574 | 9/23/2022 | WO |