The present disclosure pertains to apparatus and methods for enhancing processor performance, in particular, to apparatus and methods for reliably identifying an instruction pointer (IP) pointing to the actual instruction that causes hardware events such as data cache misses in the context of statistical sampling such as precise event based sampling (PEBS).
Computer systems may include one or more processors each of which may further include one or more cores that execute instructions through instruction pipelines. To achieve high performance of instruction execution through instruction pipelines, a processor may include digital circuits that executes instructions in anticipation of the occurrences of certain conditions. For example, a branch predictor is a digital circuit that is commonly used to predict which way a branch code (e.g., an if-then-else structure) may proceed before it is known for sure. If one branch is predicted more likely to occur, a core of the processor may fetch and execute instructions for that branch before the branching condition actually occurs. The results of these speculatively executed instructions may be stored in a storage device such as cache memory. Later, if the branching condition indeed occurs, the pre-fetched and executed instructions may be retired, and the stored results may be used. However, if the branching condition does not occur, the stored instructions are discarded, and the instruction pipeline starts over with the correct branch, incurring a penalty of delay.
The time wasted for branch misprediction may correspond to the number of stages that have been pre-fetched and executed. Since instruction pipelines in modern processors may include a significant number of stages, the time wasted for branch misprediction may include many clock cycles. Since branch misprediction, when it occurs frequently, may cause significant bottlenecks (or hotspots) to the performance of the processor, it is advantageous to monitor where the mispredictions occur and when they occur so that a user of the processor may debug and optimize the software performance accordingly.
To this end, currently, a processor may be configured with a performance monitor unit (PMU) that monitors and records the misses. The PMU may be at the micro-architecture level and monitor for hardware events pertaining to processor stalls, branch prediction, and data/code alignment, and “glass jaws” (i.e., potentially fatal defects). The collected information may be available through an operating system or an application to the user for debugging and optimizing software performance. The user may need, from PMU, information about where the most inefficient spots are, or where the processor spends the most time doing the least amount of work, or those hotspots.
The identification of hotspots may be achieved by PMU profiling the time spent and the work carried out. One profiling mechanism, called Precise Event Base Sampling (PEBS), ties hardware events to source code or instruction pointer (IP) that causes the misses. Current art requires extensive searches in post-processing of PEBS record fields to reconstruct the IP of the instruction that triggers a hardware event and the type of the event. Unfortunately, this post-processing process is not reliable.
There is a need to more accurately capture the IP of the instruction for the hardware event that the processor is configured to statistically to sample so as to improve the effectiveness of PEBS at pinpointing hotspots or areas of contention.
Although branch mispredition is one of the events that embodiments of the present invention may address, embodiments of the present invention are not limited to branch mispredition events. Embodiments of the present invention may be similarly applicable to other types of hardware events.
Embodiments of the present invention may include a performance monitor unit (PMU) that is configured to capture the actual IP of the instruction that causes the PEBS event (also called “eventing IP”) rather than the IP after the eventing IP. Embodiments of the present invention may further include a storage stored thereon a PEBS record that may include a first field for storing the eventing IP and a second field for storing a data address at which a load and/or store operation associated with the instruction accesses data. Compared to current approach, the present invention has the advantage of eliminating the need for reconstructing the eventing IP later and thus improving the success rate of data sampling.
Embodiments of the present invention may include a processor with one or more cores that each includes an execution engine unit for executing instructions, a controller, and a storage having stored thereon a statistical sampling record, in which in response to occurrence of a hardware event caused by executing an instruction, the controller may be configured to: (1) determine an instruction pointer (IP) pointed to the instruction that actually caused the hardware event; and (2) write the IP as an Eventing IP in a field of the statistical sampling record. The controller may be further configured to determine a data address at which a load/store operation associated with the instruction accesses data, and write the data address to a data address field of the statistical sampling record.
Embodiments of the present invention may include a controller embedded in a core of a processor that includes an execution engine unit for executing instructions. The controller may access a storage device having stored thereon a statistical sampling record that includes a field for, in response to occurrence of a hardware event caused by executing an instruction, storing, as an Eventing IP, an instruction pointer (IP) pointed to the instruction that actually caused the hardware event.
Embodiments of the present invention may include a method for managing a statistical sampling record of a processor. The method may include, in response to occurrence of a hardware event caused by executing an instruction, determining, as an Eventing IP, an instruction pointer (IP) pointed to the instruction that actually caused the hardware event, and writing the Eventing IP in a field of the statistical sampling record.
The shared resources 104 may include a memory storage (such as cache memory or registers) that is directly accessible by the execution engine unit 114 and the PMU 102, and also by applications. In one embodiment, the shared resources 104 may be dedicated to the core 100. In another embodiment, the shared resources 104 may be shared by a number of cores within the many-core processor. The shared resources may be further partitioned into a number of segments including general purpose counters 106, instruction pointers 108, and debug store 112. The shared resources may also include performance monitoring interrupts (PMI) 210 signals. Each of the general purpose counters 106 may be used to measure a specific hardware event. In one embodiment, the general purpose counters 106 may correspond to fifty or more hardware events including “precise branch instruction retired by type” and “mispredicted near retired calls.” The instruction pointer 108 (also known as program counter) may include registers that indicate where the execution engine unit 114 is at the execution of instruction sequence or the instruction address. The PMI 110 may provide interrupts, at user's request, in response to a counter overflow. Thus, the user may either elect to execute a program to store a statistical sampling record, such as a PEBS record, or generate a PMI 110 to halt the processor in the event of the counter overflow. The debug store 112 may store data relating to program debug. A portion of the debug store 112 may be configured to store information relating to the statistical sampling record such as PEBS record.
The PEBS buffer management area 204 may be configured to store a field 214 for storing the PEBS buffer base, a field 216 for storing the PEBS buffer index, a field 218 for storing the PEBS absolute maximum address, a field 220 for storing the PEBS interrupt threshold value, a field 222 for storing the PEBS counter reset, and a field 224 as reserved. The PEBS buffer base may be directed at the address of the first byte of PEBS buffers 226 which may be part of the DS area and include a plurality of PEBS records 328.1, 328.2, . . . , 328.n. The PEBS index, which may be referenced by a last branch record register, may be directed at the address of the first byte of the next PEBS record to be written to. The PEBS index may be initialized at the PEBS buffer base. The PEBS absolute maximum address may be directed at the next byte past the end of the PEBS buffer. The PEBS interrupt threshold may be used to generate an PEBS interrupt. The PEBS index may point to an offset that is a multiple of the PEBS record size from the PEBS buffer base and to be several records shorter than the PEBS absolute maximum. The PEBS counter resets may include full width counter values to which PEBS counters are reset after architectural state information about the core has been sampled following a PEBS counter overflow caused by a hardware event. In one embodiment, multiple PEBS records may be stored in the PEBS buffer area. The PEBS interrupt may be used to halt the processor when the PEBS buffer is about to fill up so that any new records may not be dropped. In another embodiment, a PEBS interrupt may be generated after writing each PEBS record so that the written PEBS record may be read out.
Embodiments of the present invention may include a plurality of PEBS records stored in the PEBS buffers 226 and managed through fields contained in the PEBS buffer management area 204. Each of the PEBS records may correspond to a specific hardware event.
Additionally, embodiments of the present invention may offer capability for obtaining data address to profile data memory address referenced by the instruction or ucode that caused the hardware event. In one embodiment, the PEBS data record 300 may include a register 304 for storing the direct data address, which may provide additional information about the sampled instruction and help programmers improve data structure layout, memory page handling, eliminating remote node references, and identifying cache-line condition conflicts. Instructions that have load or store operations may access memory at a particular address. Provision of this address in the PEBS record may allow a user to determine which instructions (determined by Eventing IP field) are accessing a particular line of the memory. Thus, if the PMU may monitor time-consuming cache miss events such as last level cache and determine which particular data address appears in the PEBS record repeatedly. Based on this information, a user may determine that there is a contention for that address and rework the program to resolve the contention.
In one embodiment, the controller 116 as shown in
Thus, the controller 116 may be configured to first determine whether a hardware event has occurred based on a flag indicating event overflow. The event may include a branch misprediction. However, the type of events is not limited to the branch misprediction event. If it is determined that a hardware event has occurred, embodiments of the present invention may further include steps to determine the Eventing IP (or the IP of the instruction that actually caused the event) and data address that associated with the Eventing IP. To determine the Eventing IP, the controller 116 may first determine whether a macro branch has already occurred (or “taken”) based on the branch prediction. If the macro branch has not taken, the controller 116 may further determine whether the instruction causes a fault or a special condition that may require further to clean up. If the instruction causes the fault (the IP does not move), the Eventing IP may be assigned with the faulting IP of the instruction that causes the fault (Fault_IP). However, if the instruction did not cause a fault or executed successfully (the IP moved), the Eventing IP is assigned with next IP-IP Delta, so that the Eventing IP points at the current instruction just executed (or retired). Alternatively, if the event that macro branch has taken based on the branch prediction, the Eventing IP of the current instruction may not be related to the next IP. For this case, the controller 116 may be configured to read from a “From IP” register which indicate the address from which a macro branch occurs. The Eventing IP may be assigned to the “From IP” address or the address prior to the macro branch. After the Eventing IP is determined, controller 116 may be configured to write the determined Eventing IP to the Eventing IP field of the PEBS record for the event so that the Eventing IP may be accessible by a user for debugging and optimizing applications.
The controller 116 may be further configured to retrieve data address in the event that the current instruction invoked load or store operations. To this end, if the current instruction invokes load or store operations, the controller 116 may be configured to retrieve the data addresses at which the load or store operations access data. Further, the controller 116 may be configured to write the Data Address field of the PEBS record to make it available for the user to debug and optimize programs.
Embodiments may be implemented in many different system types. Referring now to
Still referring to
Furthermore, chipset 590 includes an interface 592 to couple chipset 590 with a high performance graphics engine 538, by a P-P interconnect 539. In turn, chipset 590 may be coupled to a first bus 516 via an interface 596. As shown in
Note that while shown in the embodiment of
Embodiments may be implemented in code and may be stored on a storage medium having stored thereon instructions which can be used to program a system to perform the instructions. The storage medium may include, but is not limited to, any type of disk including floppy disks, optical disks, optical disks, solid state drives (SSDs), compact disk read-only memories (CD-ROMs), compact disk rewritables (CD-RWs), and magneto-optical disks, semiconductor devices such as read-only memories (ROMs), random access memories (RAMs) such as dynamic random access memories (DRAMs), static random access memories (SRAMs), erasable programmable read-only memories (EPROMs), flash memories, electrically erasable programmable read-only memories (EEPROMs), magnetic or optical cards, or any other type of media suitable for storing electronic instructions.
While the present invention has been described with respect to a limited number of embodiments, those skilled in the art will appreciate numerous modifications and variations therefrom. It is intended that the appended claims cover all such modifications and variations as fall within the true spirit and scope of this present invention.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/US11/67822 | 12/29/2011 | WO | 00 | 6/27/2013 |