Embodiments described herein relate to computing systems and more particularly, to performing real-time performance tracking utilizing dynamic compilers.
Managing power consumption in computing systems, integrated circuits (ICs), processors, and system-on-chips (SoCs) is increasingly important. In addition to power consumption, performance is another factor to be considered when utilizing computers and other types of processor-based electronic systems. Generally speaking, higher performance results in a higher amount of power consumed. Conversely, limiting the amount of power consumed limits the potential performance of a computer or other type of processor-based electronic system.
Programs and applications that execute on computing systems are typically generated from source code files written by a programmer. In some environments, source code is compiled into an intermediate type of code. One example of an intermediate type of code is “bytecode.” In some cases, the intermediate code is interpreted at runtime. In other cases, an additional compilation step is performed on the intermediate code. For example, dynamic compilers may perform just-in-time compilation to compile bytecode into native code during execution of the software application.
When a software application is being executed, a computing system may monitor the performance of the system hardware. Some computing systems include performance counters in the system hardware to track low level events such as instructions executed per second. While tracking such events is useful in some cases, in other cases it would be desirable to be able to monitor and responds to higher level events such as higher level transactions.
Systems, apparatuses, and methods for performing real-time performance tracking utilizing dynamic compilation are contemplated.
In various embodiments, a performance target for a computing system is determined. In one embodiment, the performance target is specified in a service level agreement (SLA). In other embodiments, the performance target is specified by a user or otherwise. In one embodiment, the performance target specifies a percentage of the maximum performance for a given computing system. In another embodiment, the performance target is specified as a performance level, such as high, medium, low. In a further embodiment, the performance target is specified according to various metrics such as transactions per second, round-trip latency, frames per second, request-response time, etc.
In one embodiment, a dynamic compiler is configured to analyze code of a software application and identify sequences of instructions deemed to corresponds to higher level transactions. In various embodiments, the dynamic compiler is a runtime compiler configured to receive and compile an intermediate type code such as bytecode. The dynamic compiler inserts additional instructions in the code to track the high-level transactions. In one embodiment, the additional instructions conveys an indication corresponding to the occurrence of the transaction (or “event”). For example, values indicative of the high-level application events are written to registers of the processor(s) of the computing system. A power optimization unit in the computing system utilizes the indication of events to determine if the computing system is meeting a specified performance target. If the computing system is exceeding the specified performance target, then the power optimization unit reduces the operating parameters (e.g., power performance state (P-state)) of one or more components of the computing system in order to reduce power consumption. If the computing system is not meeting the specified performance target, then the power optimization unit increases operating parameters of one or more components of the computing system in order to increase performance so that the specified performance target is met.
These and other features and advantages will become apparent to those of ordinary skill in the art in view of the following detailed descriptions of the approaches presented herein.
The above and further advantages of the methods and mechanisms may be better understood by referring to the following description in conjunction with the accompanying drawings, in which:
In the following description, numerous specific details are set forth to provide a thorough understanding of the methods and mechanisms presented herein. However, one having ordinary skill in the art should recognize that the various embodiments may be practiced without these specific details. In some instances, well-known structures, components, signals, computer program instructions, and techniques have not been shown in detail to avoid obscuring the approaches described herein. It will be appreciated that for simplicity and clarity of illustration, elements shown in the figures have not necessarily been drawn to scale. For example, the dimensions of some of the elements may be exaggerated relative to other elements.
Input/output memory management unit (IOMMU) 135 is also coupled to northbridge 120 in the embodiment shown. IOMMU 135 functions as a south bridge device in computing system 100. A number of different types of peripheral buses (e.g., peripheral component interconnect (PCI) bus, PCI-Extended (PCI-X), PCIE (PCI Express) bus, gigabit Ethernet (GBE) bus, universal serial bus (USB)) is coupled to IOMMU 135. Various types of peripheral devices 150A-N are coupled to some or all of the peripheral buses. Such peripheral devices include (but are not limited to) keyboards, mice, printers, scanners, joysticks or other types of game controllers, media recording devices, external storage devices, network interface cards, and so forth. At least some of the peripheral devices 150A-N that are coupled to IOMMU 135 via a corresponding peripheral bus may assert memory access requests using direct memory access (DMA). These requests (which include read and write requests) are conveyed to northbridge 120 via IOMMU 135.
In some embodiments, IC 105 includes a graphics processing unit (GPU) 140 that is coupled to display 145 of computing system 100. In some embodiments, GPU 140 is an integrated circuit that is separate and distinct from IC 105. Display 145 is a flat-panel LCD (liquid crystal display), plasma display, a light-emitting diode (LED) display, or any other suitable display type. GPU 140 performs various video processing functions and provide the processed information to display 145 for output as visual information.
In some embodiments, memory controller 130 is integrated into northbridge 120. In some embodiments, memory controller 130 is separate from northbridge 120. Memory controller 130 receives memory requests conveyed from northbridge 120. Data accessed from memory 160 responsive to a read request is conveyed by memory controller 130 to the requesting agent via northbridge 120. Responsive to a write request, memory controller 130 receives both the request and the data to be written from the requesting agent via northbridge 120. If multiple memory access requests are pending at a given time, memory controller 130 arbitrates between these requests.
In one embodiment, power optimization unit 125 is integrated into northbridge 120. In other embodiments, power optimization unit 125 is separate from northbridge 120 and/or power optimization unit 125 is implemented as multiple, separate components in multiple locations of IC 105. Power optimization unit 125 includes one or more counters for tracking one or more high-level application metrics for software applications executing on IC 105.
In some embodiments, memory 160 includes a plurality of memory modules. Each of the memory modules includes one or more memory devices (e.g., memory chips) mounted thereon. In some embodiments, memory 160 includes one or more memory devices mounted on a motherboard or other carrier upon which IC 105 is also mounted. In some embodiments, at least a portion of memory 160 is implemented on the die of IC 105 itself. Embodiments having a combination of the aforementioned embodiments are also possible and contemplated. Memory 160 is used to implement a random access memory (RAM) for use with IC 105 during operation. The RAM implemented is static RAM (SRAM) or dynamic RAM (DRAM). The type of DRAM that is used to implement memory 160 includes (but are not limited to) double data rate (DDR) DRAM, DDR2 DRAM, DDR3 DRAM, and so forth.
Although not explicitly shown in
In the embodiment shown, IC 105 includes a phase-locked loop (PLL) unit 155 coupled to receive a system clock signal. PLL unit 155 includes a number of PLLs configured to generate and distribute corresponding clock signals to each of processor cores 110 and to other components of IC 105. In this embodiment, the clock signals received by each of processor cores 110 are independent of one another. Furthermore, PLL unit 155 in this embodiment is configured to individually control and alter the frequency of each of the clock signals provided to respective ones of processor cores 110 independently of one another. The frequency of the clock signal received by any given one of processor cores 110 is increased or decreased in accordance with performance demands imposed thereupon. The various frequencies at which clock signals are output from PLL unit 155 may correspond to different operating points for each of processor cores 110. Accordingly, a change of operating point for a particular one of processor cores 110 is put into effect by changing the frequency of its respectively received clock signal.
In the case where changing the respective operating points of one or more processor cores 110 includes the changing of one or more respective clock frequencies, power optimization unit 125 changes the state of digital signals provided to PLL unit 155. Responsive to the change in these signals, PLL unit 155 changes the clock frequency of the affected processing node(s). Additionally, power optimization unit 125 also causes PLL unit 155 to inhibit a respective clock signal from being provided to a corresponding one of processor cores 110.
In the embodiment shown, IC 105 also includes voltage regulator 165. In other embodiments, voltage regulator 165 is implemented separately from IC 105. Voltage regulator 165 provides a supply voltage to each of processor cores 110 and to other components of IC 105. In some embodiments, voltage regulator 165 provides a supply voltage that is variable according to a particular operating point (e.g., increased for greater performance, decreased for greater power savings). In some embodiments, each of processor cores 110 shares a voltage plane. Thus, each processing core 110 in such an embodiment operates at the same voltage as the other ones of processor cores 110. In another embodiment, voltage planes are not shared, and thus the supply voltage received by each processing core 110 is set and adjusted independently of the respective supply voltages received by other ones of processor cores 110. Thus, operating point adjustments that include adjustments of a supply voltage are selectively applied to each processing core 110 independently of the others in embodiments having non-shared voltage planes. In the case where changing the operating point includes changing an operating voltage for one or more processor cores 110, power optimization unit 125 changes the state of digital signals provided to voltage regulator 165. Responsive to the change in the signals, voltage regulator 165 adjusts the supply voltage provided to the affected ones of processor cores 110. In instances in power is to be removed from (i.e., gated) one of processor cores 110, power optimization unit 125 sets the state of corresponding ones of the signals to cause voltage regulator 165 to provide no power to the affected processing core 110.
In one embodiment, a dynamic compiler (not shown) is configured to analyze the instructions of a software application executing on computing system 100. The dynamic compiler detects instructions which are indicative of a high-level application metric during execution of the software application. The dynamic compiler then modifies the user software application by adding one or more additional instructions to track the high-level application metric. In one embodiment, the additional instruction(s) are used to increment a counter. In another embodiment, the additional instruction(s) are used to generate timing information associated with one or more events. The additional instruction(s) are then executed to track the high-level application metric. Power optimization unit 125 then determines if the software application is meeting a performance target based on a value of the high-level application metric. In some cases, power optimization unit 125 is configured to simultaneously monitor a plurality of high-level application metrics to determine if the software application is meeting a performance target.
In various embodiments, computing system 100 is a computer, laptop, mobile device, server, web server, cloud computing server, storage system, or other types of computing systems or devices. It is noted that the number of components of computing system 100 varies from embodiment to embodiment. There can be more or fewer of each component/subcomponent than the number shown in
Turning now to
When native code 230 executes, the instruction inserted by dynamic compiler 225 is used to increment one or more counters 240 of host hardware 235. Power optimization unit 245 of host hardware 235 is configured to monitor counters 240 and determine if the execution of a software application is meeting a specified performance target based on the value of counters 240. Power optimization unit 245 is configured to adjust the parameters of host hardware 235 to increase or decrease performance based on a comparison between the values of counters 240 and the performance target. The one or more parameters include a number of active processor cores, processor voltage, processor frequency, northbridge power state, memory frequency, and/or other parameters.
Increasing the performance of the system hardware includes one or more of increasing the number of active processor cores, increasing the voltage and/or frequency supplied to the processor core(s), increasing the memory frequency, increasing the northbridge power state, and/or one or more other actions.
Referring now to
Power optimization unit 310 is configured to program PLL unit 330 and regulator 335 to generate clock signals and supply voltages for components 340A-N which will enable software executing on components 340A-N to meet performance target 325. In one embodiment, performance target 325 is specified by a user. For example, performance target 325 is extracted from a service level agreement (SLA). Alternatively, a user selects performance target 325 from a plurality of possible performance targets generated and presented to the user in a graphical user interface by a host system or apparatus.
Power optimization unit 310 includes control unit 315 and counters 320A-N for determining how to program PLL unit 330 and regulator 335. In one embodiment, a software application executing on components 340A-N of host hardware 300 is configured to write to or increment counters 320A-N as various high-level application events occur. In some embodiments, a dynamic compiler (e.g., dynamic compiler 225 of
Control unit 315 is configured to monitor counters 320A-N and determine if performance target 325 is being met. In one embodiment, control unit 315 attempts to reach performance target 325 while minimizing power consumption of components 340A-N. Control unit 315 monitors as many of counters 320A-N which are active for a given embodiment. In some embodiments, only a single one of counters 320A-N is utilized. For example, in one embodiment, only transactions per second is tracked using a single counter 320 for a given software application. In other embodiments, control unit 315 simultaneously monitors multiple counters 320A-N to determine if performance target 325 is being met. It is noted that depending on the embodiment, counters 320A-N are implemented using registers, counters, or any other suitable storage elements.
In one embodiment, control unit 315 performs a direct comparison of one or more counters 320A-N to the performance target 325. For example, performance target 325 specifies a given number of transactions per second and a given counter 320 tracks the number of transactions per second being performed on host hardware 300 for a given software application. In another embodiment, control unit 315 performs a translation or conversion of the values of one or more of counters 320A-N to determine if performance target 325 is being met. For example, performance target 325 specifies a level (e.g., high, medium, low) or a percentage (e.g., 50%, 70%) of maximum performance, and control unit 315 converts one or more of counters 320A-N to a value which can be compared to performance target 325. In some embodiments, control unit 315 combines multiple values from counters 320A-N utilizing different weighting factors to create a single value which can be compared to performance target 325. In some embodiments, a calibration procedure is performed on host hardware 300 with power optimization unit 310 tracking various metrics during different periods of time when host hardware 300 is operated at a highest operating point and a lowest operating point. Power optimization unit 310 and control unit 315 then utilizes interpolation to determine values and metrics associated with other operating points in between the highest and lowest operating points. In some cases, the calibration procedure is performed at intermediate operating points rather than just the highest and lowest operating points.
Turning now to
In one embodiment, a dynamic compiler is configured to identify an outermost loop (e.g, the sequence B→ . . . →N→B) in program code (e.g., intermediate code) during run-time compilation. In some embodiments, the compiler is configured to generate a control flow graph(s) based on analysis of the program code and identify loops based on the control flow graph. In various embodiments, such an outermost loop is deemed to correspond to a high level transaction. In various embodiments, loops other than the outermost loop are also identified and deemed to correspond transactions. As one example, the dynamic compiler is configured to identify a particular portion of program code as corresponding to a particular type of database transaction. In various embodiments, such a portion is identified based on indications within the code itself (e.g., instructions that identify a particular module, function, procedure, library, method, etc.). By identifying an outermost loop within the portion, the compiler designates that loop as representing the database transaction. As may be appreciated, the actual number of low level operations or instructions that make up such a transaction can vary significantly and can be relatively large.
One metric that is used to capture performance of a software application is transaction throughput. Given a selected performance target, a control unit (e.g., power optimization unit 310 of
As noted above, a dynamic compiler (e.g., dynamic compiler 225 of
Referring now to
In various embodiments, a dynamic compiler (e.g., dynamic compiler 225 of
Having determined a transaction of interest, the dynamic compiler analyzes the software application (program code) to identify and/or build a dominator tree of the software application being executed (block 510). Generally speaking, when building and analyzing control flow graphs, a dominator tree is a tree in the control flow graph where each node of the tree dominates its children's nodes. A first node is said to dominate a second node if every path from an entry node to the second node must pass through the first node. Having built a control flow graph for the program code, the dynamic compiler detects back edges (i.e., branches within the code that branch back to an earlier point in the program code) of the software application (block 515). Such back edges are indicative of loops in the software application that represent a repeated operation or transaction. Next, the dynamic compiler finds the outer most loop of the software application (block 520). These outermost loops represent higher level transactions within the program code. In some embodiments, each iteration of this loop is designated a dynamic compiler transaction (DCT), or “transaction”. Having identified such a transaction, the dynamic compiler modifies the program code (block 525) by inserting instructions in the code that are configured to monitor the identified transaction(s).
When determining which loops/transactions to monitor, the dynamic compiler is configured to consider additional factors. For example, in some embodiments the compiler is configured to identify loops or transactions that are “hotter” than others. Generally speaking, a loop or transaction is considered hotter than another if it is repeated more often. If a loop is determined to be relatively hot (conditional block 530, “yes” leg), then the dynamic compiler modifies the program code to track that loop during execution. In some embodiments, a loop is deemed hot if the number of iterations exceeds a threshold number of iterations. As described above, such tracking involves conveying an indication that the loop or transaction has occurred, or has occurred a given number of times. In response, the system alters performance parameters depending on whether a performance target is being met. For example, a performance target is N transactions per a given interval. The received indication indicates the monitored transaction has occurred (N−M) times during the given interval. As such, the performance target is not being met and the system alters performance parameters to increase performance. Altering performance parameters to increase performance includes one or more of increasing an operating frequency, increasing allocation of various resources, or otherwise.
In various embodiments, the “count” of transactions that is indicated is modified when determining whether a performance target is being met. In some embodiments, the system is configured to convert the software performance target ‘St’ to a dynamic compiler transaction target (DCTt) (block 535). Such a conversion is performed by the compiler or by another entity in the system. In some embodiments, this DCTt is assigned as the performance target level (block 545). For example, it is determined that the maximum performance level of the system for a given transaction type is DCTmax and it is determined that the performance target represents a given percent, r, of this maximum performance level. Therefore, DCTt=r×DCTmax. Additionally, the dynamic compiler inserts a transaction increment instruction(s) in the program code (e.g., the first node of the loop) (block 555).
On the other hand, if an identified loop is not deemed hot (conditional block 530, “no” leg), then the dynamic compiler moves one level deeper (block 540) to a next inner level loop to examine a different loop. If the newly identified loop has a back edge (conditional block 550, “yes” leg), then method returns to block 525 to profile the code. If the loop does not have a back edge (conditional block 550, “no” leg), then the dynamic compiler concludes that no hot transaction has been found (block 560). After block 560, method 500 ends.
Turning now to
Next, the computing system analyzes a software application to identify a first high-level application event which matches the first performance metric (block 615). For example, if the first performance metric is a specified number of transactions per second, then the computing system analyzes the software application to detect a corresponding transaction. In one embodiment, the computing system utilizes a dynamic compiler to analyze intermediate program code of the software application in order to identify the first high-level application event/transaction which matches or otherwise corresponds to the first performance metric.
Subsequent to identifying the transaction, the computing system inserts one or more instructions into the software application to track the first high-level application event (block 620). In one embodiment, the computing system inserts an instruction(s) to increment a count responsive to detecting an occurrence of first high-level application event. Next, the computing system monitors the first high-level application event to determine if the performance target is being met (block 625). In some embodiments, the computing system converts a count of the first high-level application event and counts of any number of other high-level application events into a value that corresponds to a performance target. For example, the performance target is specified as a percentage of the maximum software performance of the system. In this example, the count of the first high-level application event is converted into a percentage by dividing the count by the maximum attainable event frequency when the system is operating at peak performance. In other embodiments, other techniques for translating the count of the first high-level application event into a value that corresponds to a performance target are possible and are contemplated. Alternatively, in another embodiment, the performance target is converted into a value that corresponds to the count of the first high-level application event.
If the performance target is being met (conditional block 630, “yes” leg), then the computing system reduces one or more parameters of the computing system to reduce performance and to reduce power consumption (block 635). Otherwise, if the performance target is not being met (conditional block 630, “no” leg), then the computing system increases one or more parameters of the computing system to increase performance and to increase power consumption (block 640). It is noted that if a frequency (or other value) of the first high-level application event is within a given range of the performance target in conditional block 630, then the computing system maintains the current state of the system hardware, rather than increasing or reducing the one or more parameters. After blocks 635 and 640, method 600 returns to block 625 with the computing system continuing to monitor the first high-level application event to determine if the performance target is being met. It is noted that the computing system tracks and monitors multiple high-level application events in other embodiments to determine if the performance target is being met.
Referring now to
Next, the computing system is operated at a minimum hardware configuration for a given period of time (block 720). The minimum configuration corresponds to a lowest power state of the computing system where the computing system is still operable to execute applications. Similar to the above, the minimum configuration is with respect to a given set of resources of the system. While the computing system is being operated at its minimum configuration, the computing system monitors one or more high-level application metrics (block 725). Then, the computing system records the value(s) of the one or more high-level application metrics after the given period of time (block 730).
Next, the computing system receives a performance target (block 735). In one embodiment, the performance target is specified using one or more high-level application metrics. In one embodiment, the performance target is specified in a license agreement or service level agreement. Then, the computing system calculates a system configuration that will meet the performance target based on the recorded values of the high-level application metrics for the maximum and minimum hardware configurations (block 740). For example, in one embodiment, the computing system utilizes linear interpolation to calculate which system configuration will meet the performance target. For example, if 100 transactions were executed per second at the maximum configuration, if 40 transactions were executed per second at the minimum configuration, and the performance target specified 70 transactions per second, then the system configuration is set to a midpoint configuration to meet the performance target of 70 transactions per second. It is noted that in other embodiments, method 700 operates the computing system at multiple different configurations rather than just the maximum and minimum configuration. After block 740, method 700 ends.
In various embodiments, program instructions of a software application are used to implement the methods and/or mechanisms previously described. The program instructions describe the behavior of hardware in a high-level programming language, such as C. Alternatively, a hardware design language (HDL) is used, such as Verilog. The program instructions are stored on a non-transitory computer readable storage medium. Numerous types of storage media are available. The storage medium is accessible by a computing system during use to provide the program instructions and accompanying data to the computing system for program execution. The computing system includes at least one or more memories and one or more processors configured to execute program instructions.
It should be emphasized that the above-described embodiments are only non-limiting examples of implementations. Numerous variations and modifications will become apparent to those skilled in the art once the above disclosure is fully appreciated. It is intended that the following claims be interpreted to embrace all such variations and modifications.