1. Field of the Invention
The present invention is related to computer systems in which performance is measured using hardware measurement circuits, and in particular to techniques for maintaining performance monitoring measurements across program execution cycles.
2. Description of Related Art
In computer systems, performance can be improved by monitoring the performance of the computer system while executing various programs, for example, the number of instructions executed or the total time elapsed while performing a task is a benchmark indication of the efficiency of the computer system at performing the task. By observing characteristics of program execution, in particular, by observing characteristics of “hot spots”, i.e., portions of a program that are executed most frequently, the program can be optimized, either off-line or on-the-fly, using the result of the performance measurements.
However, when a task is off-loaded, when the present execution of a program is terminated, to be resumed at a later time and the program is unloaded from memory, the state of the performance monitoring hardware is typically lost, making it difficult to monitor performance of tasks that are executed intermittently. In some cases the performance monitoring state may not be accessible so that the state cannot be stored and retrieved when the task is off-loaded.
A particular performance monitoring technique, as disclosed in U.S. patent application Ser. No. 12/828,697 filed on Jul. 10, 2010 entitled “HARDWARE ASSIST FOR OPTIMIZING CODE DURING PROCESSING”, having common inventors with the present U.S. patent application, and which is incorporated herein by reference, identifies execution paths, i.e., sequences of program instructions, in which all of the branch instructions resolve to particular directions, so that the most frequently executed paths, corresponding to the hot spots described above, are given the most effort and resources for program optimization. Rather than collecting the entire state of the branch history for each execution path in order to identify which path is currently being taken by a program, a simplified technique uses branch prediction data to assume a particular execution path is taken if all predictions are correct. Branch prediction state information is also typically not retained, and may not be accessible for storage and retrieval.
The invention is embodied in a method, a computer system, a processor core, and a computer program product, in which performance monitoring information is not retained when a task is off-loaded and when a task is loaded for execution, performance monitoring analysis is postponed until sufficient performance monitoring has been performed to ensure accuracy of the results.
The performance monitoring output or analysis may be delayed for a predetermined time period or instruction cycles, and may be triggered by a computer program such as a hypervisor, indicating that the task has been loaded and the delay should be started. After the delay has expired, the performance monitoring results may be analyzed.
The performance monitoring may be a program execution branch analysis that determines frequently executed execution paths by using successful branch predictions to provide an indication that a particular execution path is being taken and the application of the technique may be postponed until the branch history information for the new task execution session has been updated and the effects of state information retained from a previous session or generated as an initialized state (e.g., reset state) has been attenuated.
The foregoing and other objectives, features, and advantages of the invention will be apparent from the following, more particular, description of the preferred embodiment of the invention, as illustrated in the accompanying drawings.
The novel features believed characteristic of the invention are set forth in the appended claims. The invention itself, however, as well as a preferred mode of use, further objectives, and advantages thereof, will best be understood by reference to the following detailed description of the invention when read in conjunction with the accompanying Figures, wherein like reference numerals indicate like components, and:
The present invention encompasses techniques for program performance monitoring in computer systems in which program operation may be interrupted by context and/or task switching. Rather than saving the state of performance monitoring hardware, which may not be possible in some hardware designs, when program execution is resumed, a delay is commenced to postpone analysis of the performance monitoring results until sufficient performance monitoring has been performed for the current execution cycle, in order to ensure accuracy of the results. In a particular embodiment of the present invention, the performance monitoring collects trace segments from branch history information in order to locate program hotspots for optimization, or other reasons for which the trace segment information is desirable. The trace segment information is not gathered until the branch history information has been sufficiently updated for each new execution cycle, preventing information from previous execution cycles of other programs from generating invalid segment analysis results.
In illustrated core 20, a performance monitoring unit 40 gathers information about operation of processor core 20, including performance measurements, which in the illustrative embodiment are trace segment analysis results gathered by a trace segment detector 37. Trace segment detector uses branch prediction and branch prediction accuracy information provided by a branch history table 39, which receives information from a branch prediction unit 36 that may be provided only for performance monitoring, or which may also be used for speculative execution or speculative pre-fetching by processor core 20.
As execution of a program proceeds, branch prediction unit 36 updates branch history table 39 with a list of branch instructions that have been encountered, an indication of the most likely branch result for each of the branch instructions, and optionally a confidence level of the branch prediction. Trace segment detector 37 uses the information in branch history table 39 to distinguish segments of programs, and to provide useful information such as the number of times a particular segment has been executed. Since, with a few exceptions, branch instructions completely delineate patterns of program flow in which all instructions in a given segment are executed when the segment is entered, it is only necessary to collect the branch information in order to completely describe the segments of a program. In the present invention, a mechanism prevents trace segment detector from constructing segments, i.e., from analyzing the information in branch history table 39 until sufficient information has been updated for the current execution slice and/or program task session.
Referring now to
In the particular embodiment illustrated, timer 38 is started and re-started each time a “1” is written to a control register (or a bit in a control register, which is understood to be a one-bit control register). By providing a readback of a “1” at the control register that is independent of the true state of timer 38, the starting of timer 38 by a hypervisor (or other operating system or meta-operating system) that controls the task or context switching is automatically arranged, as long as the control register is part of the machine state saved at the context switch. Since, when the task is re-started, a value of “1” will always be written back to the control register, timer 38 will be started each time the context is switched. If the context is switched before timer 38 has expired, timer 38 will be restarted, which provides that performance monitoring data will only be analyzed for execution intervals that are of sufficiently duration. The timer can be a programmable value, or as mentioned above, the delay may be based on another count, for example, a count of the number of times a particular instruction is executed, where the address of the particular instruction may be specified by a register that has been previously written by a program, or the timer count may be incremented/decremented each time a branch instruction (or other type of instruction) is executed.
Referring now to
Referring now to
The result of the above processing is a collection of segments in which only one instance of a branch instruction indication appears for each branch instruction reached, and that does not grow unless branch instructions are observed taking non-predicted directions. Further, a count is generally maintained that is incremented at each entry to a segment. Since branch prediction information is continually updated, if execution centers around one particular execution path, the count for that execution path will be much greater than the others, and can be targeted for optimization. The present invention ensures that stale branch prediction data is not used in forming the segments by using delay or other postponement of the segment formation. If the segment formation was not postponed, the segments formed in the method illustrated in
As noted above, portions of the present invention may be embodied in a computer program product, which may include firmware, an image in system memory or another memory/cache, or stored on a fixed or re-writable media such as an optical disc having computer-readable code stored thereon. Any combination of one or more computer readable medium(s) may store a program in accordance with an embodiment of the invention. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
In the context of the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
While the invention has been particularly shown and described with reference to the preferred embodiments thereof, it will be understood by those skilled in the art that the foregoing and other changes in form, and details may be made therein without departing from the spirit and scope of the invention.
Number | Name | Date | Kind |
---|---|---|---|
5381533 | Peleg et al. | Jan 1995 | A |
5875324 | Tran et al. | Feb 1999 | A |
5970439 | Levine et al. | Oct 1999 | A |
6170083 | Adl-Tabatabai | Jan 2001 | B1 |
6253338 | Smolders | Jun 2001 | B1 |
6351844 | Bala | Feb 2002 | B1 |
6513133 | Campbell | Jan 2003 | B1 |
6539500 | Kahle et al. | Mar 2003 | B1 |
6647491 | Hsu et al. | Nov 2003 | B2 |
6920549 | Ukai | Jul 2005 | B1 |
7000095 | Jeppesen et al. | Feb 2006 | B2 |
7076640 | Kadambi | Jul 2006 | B2 |
7103877 | Arnold et al. | Sep 2006 | B1 |
7243350 | Lindwer | Jul 2007 | B2 |
7315795 | Homma | Jan 2008 | B2 |
7490229 | Tremblay et al. | Feb 2009 | B2 |
7496908 | DeWitt et al. | Feb 2009 | B2 |
7603545 | Sunayama et al. | Oct 2009 | B2 |
7657893 | Armstrong et al. | Feb 2010 | B2 |
7694281 | Wang et al. | Apr 2010 | B2 |
7765387 | Sunayama et al. | Jul 2010 | B2 |
7949854 | Thaik et al. | May 2011 | B1 |
8042007 | Chan et al. | Oct 2011 | B1 |
8261244 | Pietrek | Sep 2012 | B2 |
8281304 | Kimura | Oct 2012 | B2 |
8407518 | Nelson et al. | Mar 2013 | B2 |
8612730 | Hall et al. | Dec 2013 | B2 |
20020066081 | Duesterwald et al. | May 2002 | A1 |
20050081107 | DeWitt et al. | Apr 2005 | A1 |
20050132363 | Tewari et al. | Jun 2005 | A1 |
20050210454 | DeWitt et al. | Sep 2005 | A1 |
20060005180 | Nefian et al. | Jan 2006 | A1 |
20060080531 | Sinha et al. | Apr 2006 | A1 |
20060168432 | Caprioli et al. | Jul 2006 | A1 |
20070294592 | Ashfield et al. | Dec 2007 | A1 |
20080086597 | Davis et al. | Apr 2008 | A1 |
20080171598 | Deng | Jul 2008 | A1 |
20080222632 | Ueno et al. | Sep 2008 | A1 |
20090037709 | Ishii | Feb 2009 | A1 |
20090254919 | Jayaraman et al. | Oct 2009 | A1 |
20100017791 | Finkler | Jan 2010 | A1 |
20100306764 | Khanna | Dec 2010 | A1 |
20110107071 | Jacob (Yaakov) | May 2011 | A1 |
20120005462 | Hall | Jan 2012 | A1 |
20120005463 | Mestan et al. | Jan 2012 | A1 |
20120254670 | Frazier et al. | Oct 2012 | A1 |
20120323806 | Abrams et al. | Dec 2012 | A1 |
20130055033 | Frazier et al. | Feb 2013 | A1 |
Entry |
---|
Rotenberg, et al., “Trace cache: a low latency approach to high bandwidth instruction fetching”, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2-4, 1996, pp. 24-34, xii+359, IEEE Comput. Soc. Press, Paris, FR. |
Jacobson, et al., “Trace Preconstruction”, Proceedings of 27th International Symposium in Computer Architecture, Jun. 14, 2000, pp. 37-46, ACM vi+328, Vancouver, BC, CA. |
Merten, et al., “A hardware-driven profiling scheme for identifying program hot spots to support runtime optimization”, Proceedings of the 26th Annual International Symposium on Computer Architecture, May 2-4, 1999, pp. 136-148, IEEE Comp. Soc. Press, Atlanta, GA. |
Patel, et al., “Improving trace cache effectiveness with branch promotion and trace packing”, Proceedings of the 1998 25th Annual International Symposium on Computer Architecture, Jun. 27-Jul. 1, 1998, pp. 262-271, IEEE Computer Soc. Press, Barcelona, ES. |
Yeh, et al., “Increasing the instruction fetch rate via multiple branch prediction and a branch address cache”, ICS 1993 Proceedings of the 7th International Conference on Supercomputing, Jul. 1993, pp. 67-76, Tokyo, JP. |
Liu, “Predict Instruction Flow Based on Sequential Segments”, IBM Technical Disclosure Bulletin, Apr. 1991, pp. 66-69, vol. 33, No. 11. |
Shi, et al., “Analyzing the Effects of Trace Cache Configurations on the Prediction of Indirect Branches”, Journal of Instruction-Level Parallelism, Feb. 2006, Raleigh, NC. |
Zagha, et al.,“Performance Analysis Using the MIPS R10000 Performance Counters”, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing, Nov. 1996, Pittsburgh, PA. |
Nataraj, et al., “Ghost in the Machine: Observing the effects of Kernel Operation on Parallel Application Performance”, International Conference for High Performance Computing, Networking, Storage and Analysis, Nov. 2007, Reno, NV. |
Anonymous, “Fast Identification of Previously-retrieved Callstacks”, ip.com document IPCOM000200962D, Nov. 2010. |
Intel Itanium2 Processor Reference Manual for Software Development and Optimization, May 2004, US. |
“Intel 64 and IA-32 Architectures Optimization Reference Manual”, Jun. 2011, US. |
Bala, et al., “Dynamo: A Transparent Dynamic Optimization System”, In Proceedings of Programming Language Design and Implementation (PLOD), 2000, US. |
Bond, et al., “Probabilistic Calling Context”, In Proceedings of Object Oriented Programming Systems Languages and Applications (OOPSLA) 2007, US. |
Odaira, et al., “Efficient Runtime Tracking of Allocation Sites in Java”, In Proceedings of Virtual Execution Environments (VEE), 2010, US. |
Lu, et al., “Design and Implementation of a Lightweight Dynamic Optimization System”, Journal of Instruction Level Parallelism, Apr. 2004, US. |
Mars, et al., “MATS: Multicore Adaptive Trace Selection”, IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 3rd Workshop on Software Tools for MultiCore Systems, Apr. 6, 2008, 6 pgs., Boston, MA. |
Merten, et al., “A Hardware Mechanism for Dynamic Extraction and Relayout of Program Hot Spots”, Proceedings of the 27th Annual International Symposium on Computer Architecture, May 2000, pp. 59-70, vol. 28, Issue 2, Vancouver, BC, Canada. |
Office Action in U.S. Appl. No. 12/828,697 mailed on Feb. 7, 2013. |
U.S. Appl. No. 12/828,697, filed Jul. 1, 2010, Hall, et al. |
Shi, et al., “Analyzing the Effects of Trace Cache Configurations on the Prediction of Indirect Branches”, Journal of Instruction-Level Parallelism, Feb. 2006, 24 pages (pp. 1-24 in pdf), Raleigh, NC. |
Zagha, et al.,“Performance Analysis Using the MIPS R10000 Performance Counters”, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing, Nov. 1996, 20 pages (pp. 1-20 in pdf), Pittsburgh, PA. |
Nataraj, et al., “The Ghost in the Machine: Observing the Effects of Kernel Operation on Parallel Application Performance”, International Conference for High Performance Computing, Networking, Storage and Analysis, Nov. 2007, 12 pages (pp. 1-12 in pdf), Reno, NV. |
Anonymous, “Fast Identification of Previously Retrieved Callstacks”, ip.com document IPCOM000200962D, Nov. 2010, 3 pages (pp. 1-3 in pdf). |
Intel 64 and IA-32 Architectures Software Developers Manual, vol. 3A: System Programming Guide, Part 1, Mar. 2010, 812 pages (pp. 1-812 in pdf), US. |
Intel Itanium 2 Processor Reference Manual for Software Development and Optimization, May 2004, 196 pages (pp. 1-196 in pdf), US. |
“Intel 64 and IA-32 Architectures Optimization Reference Manual”, Jun. 2011, 774 pages (pp. 1-774 in pdf), US. |
Bala, et al., “Dynamo: A Transparent Dynamic Optimization System”, In Proceedings of Programming Language Design and Implementation (PLOD), 2000, pp. 1-12, US. |
Bond, et al., “Probabilistic Calling Context”, In Proceedings of Object Oriented Programming Systems Languages and Applications (OOPSLA) 2007, 15 pages( pp. 1-15 in pdf), US. |
Odaira, et al., “Efficient Runtime Tracking of Allocation Sites in Java”, In Proceedings of Virtual Execution Environments (VEE), 2010, 12 pages (pp. 1-12 in pdf), US. |
Lu, et al., “Design and Implementation of a Lightweight Dynamic Optimization System”, Journal of Instruction Level Parallelism, Apr. 2004, 24 pages (pp. 1-24 in pdf), US. |
Office Action in U.S. Appl. No. 12/828,697 mailed on Feb. 7, 2013, 27 pages (pp. 1-27 in pdf). |
Number | Date | Country | |
---|---|---|---|
20120254837 A1 | Oct 2012 | US |