1. Field of the Invention
Embodiments relate to software techniques for optimizing the performance of a computing platform.
2. Background
A performance analyzer is a tool for performing profiling operations, which is a process of generating a statistical analysis to measure resource usage during the execution of a program. The result of profiling enables the user to optimize the performance of the portion of the program where CPU cycles are consumed the most. The program may be a user application or a system program such as an operation system (OS) program. One example of a performance analyzer, the Intel Vtune®, is a product of Intel Corporation located in Santa Clara, Calif.
One important procedure of profiling is to identify those functions and subroutines that consume significant numbers of CPU cycles. A performance analyzer typically reveals the “hot” code paths—the sets of functions and subroutines most actively invoked. In a large application, the time spent by a compiler to search for optimization opportunities may grows exponentially with the number of modules it is asked to consider. Thus, optimization efficiency improves if the user can identify the most critical modules and functions in their application. Optimization techniques may be applied to these identified modules and functions to achieve better data prefetching, parallelization, and reordering of instructions. The optimization may reduce the numbers of stalled cycles and increase the program execution speed.
Conventional performance analyzer is processor event-driven. That is, the analyzer collects information only when a processor event occurs. A processor event refers to an event generated by the central processing unit (CPU) that causes an interruption of instruction execution of the processor. Processor events (or equivalently, CPU events) include a cache miss, branch misprediction, and any event that causes a stalled cycle in the execution pipeline. However, a user is currently unable to consider events generated by platform components that share the same platform with the CPU. These platform component events may be correlated with instruction execution and may provide useful information for performance optimization.
Embodiments are illustrated by way of example and not by way of limitation in the figures of the accompanying drawings in which like references indicate similar elements. It should be noted that references to “an” or “one” embodiment in this disclosure are not necessarily to the same embodiment, and such references mean at least one.
Main memory 13 may include a system area for storing system level instructions and data (e.g., operating system (OS) and system configuration data) which are not normally accessible by a user. Main memory 13 may also include a user area for storing user programs and applications (e.g., application 131). Although shown as one memory component, main memory 13 may comprise a plurality of memory devices including read-only memory (ROM), random access memory (RAM), flash memory, and any machine-readable medium.
In one embodiment, a performance analyzer 135 is stored in the system area of main memory 13. Performance analyzer 135 allows a user of platform 10 to monitor instruction execution by CPU 11 when a pre-determined event occurs. An event may be a processor event generated by CPU 11. For example, a processor event may be a cache miss when an instruction or data to be used by CPU 11 is not found in cache 116. A processor event may be a branch misprediction when a conditional statement predicted to be true does not actually become true. An event may alternatively be a virtual event generated by any one of the platform components. For example, a virtual event may be a V_sync generated by GPU 12 at the end of displaying a frame, or a bus throughput generated by Ethernet 16 each time a predetermined number of packets are delivered. A virtual event may be an event triggered by a signal generated by a platform component (e.g., V_sync) or an event defined by a user (e.g., number of packets delivered). Performance analyzer 135 provides a user interface for a user to select one or more of the processor events, and to define and select one or more of the virtual events to be monitored, recorded, analyzed, and reported.
When an event occurs, the event triggers an interrupt in CPU 11. The instruction currently executed by CPU 11 is temporarily suspended. The suspended instruction is referred to as the “interrupted instruction.” The CPU 11 may consult an interrupt vector table 138 to locate an interrupt service routine (ISR) for handling the interrupt. Interrupt vector table 138 may reside in the system area of main memory 13. The base address of interrupt vector table 138 may be stored in an internal register of CPU 11 to be readily accessible by the CPU at all times. Interrupt vector table 138 stores a plurality of interrupt vectors, each of which serves as an identifier to an ISR. The ISR saves the status of the interrupted CPU 11 and performs pre-defined operations to service the interrupt. Each ISR may service one or more processor events or virtual events. For example, virtual events generated by the same platform components may have the same interrupt vector and be serviced by the same ISR.
Referring to
In one embodiment, performance analyzer 135 includes a Virtual Event Provider Manager (VEPM) 24 and a plurality of Virtual Event Provider Drivers (VEPDs) 25, both implemented as software stored in the system area of main memory 13. Each of the platform components may be associated with one VEPD 25. VEPD 25 supplies a definition for every virtual event supported by the associated platform component. A definition of a virtual event may include an event name, a description, and an interrupt vector that will be generated by the VEPD 25 when the virtual event occurs. For example, a graphics display device driver (i.e., the VEPD 25 of GPU 12) may store a definition (event_name: V_Sync, description: vertical sync signals occurring during a frame display, interrupt vector: PCI_Interrupt#11) for V_sync events. Additionally, each VEPD 25 may also supply a local index, a.k.a., an event_id, for each of its supported virtual events. The local index may be an integer number that uniquely identifies a virtual event within a VEPD 25.
VEPM 24 also interfaces with a user who may select the virtual events to be analyzed. At 330, VEPM 24 populates all of the supported virtual events on a user interface. These virtual events may include user-defined events as well as hardware events generated by platform components 35. These virtual events may be presented alongside with processor events for user selection. At 340, the user selects one or more virtual events to be analyzed by performance analyzer 135. One or more of these virtual events may be pre-defined by the user. At the same time, the user may also select one or more processor events to be analyzed by performance analyzer 135.
The user may also specify configurable items of the virtual events through the user interface. For example, sampling parameters may be specified by the user. As sampling buffers 26 may not have enough space to store information of every occurrence of a selected virtual event, only a fractional portion of the occurrences are sampled and stored. The user may specify a sampling period during which performance analyzer 135 will run and a sampling rate to define how often an occurrence of a virtual event will be stored. At 350, VEPM 24 configures each VEPD 25 with these user-specified configuration values. For example, the user may specify an “after_value” which defines the rate of sampling. An “after_value” of 10 means one virtual event is sampled out of every ten occurrences of the same virtual event. Thus, an “after_value” of 10 corresponds to a sampling rate of 0.1. After the user specifies the after_value for a virtual event, VEPM 24 configures the VEPD 25 associated with the platform component 35 generating the virtual event with the command VEPD::setEventAfter value(event_id, after_value). In one embodiment, the event_id in the command may be the local index of the virtual event supported by the VEPD 25 that receives the command. After receiving the command, at 360, VEPD 25 configures the associated platform component 35 with the specified configuration value. Thus, VEPM 24 and VEPDs 25 provide a forwarding mechanism to forward configuration values to platform components 35, thus allowing a user to configure these platform components.
At 370, VEPM 24 stores the interrupt vectors of the selected virtual events into interrupt vector table 138 (
At block 440, the virtual event interrupt signals CPU 11 with an interrupt vector, which can be located in interrupt vector table 138 of
In one embodiment, the analysis reported to a user may include the percentage of occurrences of a particular event in the subroutines of application 131. For example, if V_sync is the selected virtual event and application 131 includes subroutines sub_a, sub_b, and sub_c, the report may show that the percentage of the V_sync occurrences in sub_a, sub_b, and sub_c are 97%, 2%, and 1%, respectively. Thus, the user may recognize that sub_a is a hotspot with respect to V_sync. The user may find out more detailed information to correlate the instructions of sub_a with V_sync by selecting sub_a (e.g., a sub_a icon) on the user interface. If sub_a further includes subroutines sub_a1, sub_a2, and sub_a3, the report may show that the percentage of the V_sync occurrences in sub_a1, sub_a2, and sub_a3 are 5%, 90%, and 5%, respectively. The user may continue this process to go down the subroutine hierarchies until the bottom of the hierarchy is reached.
With the wealth of information revealed by performance analyzer 135, the user is better equipped with knowledge to fine-tune the performance of the program. The user may be able to recognize a correlation between the program instructions and the occurrences of events generated by any platform components. The user may recognize hotspots in the program and realize why cycles are being spent there. The exact cause of inefficiency may also be identified.
In the foregoing specification, specific embodiments have been described. It will, however, be evident that various modifications and changes can be made thereto without departing from the broader spirit and scope of the appended claims. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/CN05/02416 | 12/30/2005 | WO | 00 | 4/27/2006 |