Embodiments of the present invention relate to computer systems and more particularly to effective use of resources of such a system.
Computer systems execute various software programs using different hardware resources of the system, including a processor, memory and other such components. A processor itself includes various resources including one or more execution cores, cache memories, hardware registers, and the like. Certain processors also include hardware performance counters that are used to count events or actions occurring during program execution. For example, certain processors include counters for counting memory accesses, cache misses, instructions executed and the like. Additionally, performance monitors may also exist in software to monitor execution of one or more software programs.
Together, such counters and monitors can be used according to different usage models. As an example, they may be used during compilation and other optimization activities to improve code execution based upon profile information obtained during program execution. The collection of profile information for use in feedback-directed dynamic optimization has grown tremendously in importance in recent years, as significant amounts of new software is being written in a managed language. Traditional feedback-directed optimization techniques rely on instrumenting a program to collect profiles, requiring compilation to insert hooks to collect the data, running the program with a high overhead, and then recompiling with the profile information to obtain a production binary. Instrumentation code cannot collect information about a behavior that it cannot directly observe, such as hardware memory cache behavior. In another usage model, upon occurrence of an event in a counter or monitor during program execution, one or more helper threads may be called. Such helper threads are software routines that are called by a calling program to improve execution, such as to prefetch data from memory or perform another activity to improve program execution.
Oftentimes, these resources are used inefficiently, and furthermore use of such resources in the different usage models can conflict. A need thus exists for improved manners of obtaining and using monitors and performance information in these different usage models.
Referring now to
Monitor 40 may include various programmable logic, software and/or firmware to track activities in performance counters 45 and channels 50a-50d. Channels 50a-50d may be register-based storage media, in one embodiment. A channel is an architectural state that includes a specification and occurrence information for a scenario, as will be discussed below. In various embodiments, a core may include one or more channels. There may be one or more channels per software thread, and channels may be virtualized per software thread. Channels 50a-50d may be programmed by monitor 40 for various usage models, including performance-guided optimization (PGOs) or in connection with improved program performance via the use of helper threads or the like.
While shown as including four such channels in the embodiment of
Still referring to
Referring now to
A scenario defines a composite condition. In other words, a scenario defines one or more performance events or conditions that may occur during execution of instructions in a processor. These events or conditions, which may be a single event or a set of events or conditions, may be architectural events, microarchitectural events or a combination thereof, in various embodiments. Scenarios thus define what can be detected and stored in hardware, and presented to software. A scenario includes a triggering condition, such as the occurrence of multiple conditions during program execution. While these multiple conditions may vary, in some embodiments the conditions may relate to low progress indicators and/or other microarchitectural or structural details of actions occurring in execution resources 22, for example. The scenario may also define processor state data available for collection, reflecting the state of the processor at the time of the trigger. In various embodiments, scenarios may be hard-coded into a processor. In these embodiments, scenarios that are supported by a specific processor may be discovered via an identification instruction (e.g., the CPUID instruction in an x86 instruction set architecture (ISA), hereafter an “x86 ISA”).
A service routine is a per scenario function that is executed when a yield event occurs. As shown in
Still referring to
Software programs hardware with a scenario, which causes the hardware to detect predefined events and collect predefined information. The software may thus configure the hardware initially, and then start, pause, resume, and stop collections. In some embodiments, a separate software routine, i.e., a service routine may perform data collection. Sampling collection mechanisms may include initializing a channel, collecting a profile sample and/or reading an event count, and modifying a previously programmed channel to pause, resume, stop, or modify a scenario's current parameters.
Returning now to
As shown in
In various embodiments, profiling software 80 programs a light-weight, user-level control yield mechanism in processor 10 to monitor specific hardware events (i.e., scenarios). When a scenario triggers (i.e., yields), the processor calls a service routine, which itself may be within profiling software 80. The service routine may collect information about the hardware's state and buffer it for later delivery to, for example, DPGO system 90. The service routine may also act on the information directly before returning to the planned stream of execution. The light-weight control yield, i.e., an asynchronous transfer, may cause a transfer from the planned stream of execution in a software thread to a service routine function defined by a channel and back to the planned stream of execution without operating system (OS) involvement. In other words, this user-level interrupt bypasses the OS entirely, enabling finer grained communication and synchronization transparently to the OS. Thus, an interrupt caused upon triggering of a scenario (e.g., a yield) is handled internally by user-level software. Accordingly, there is no external interrupt to the OS from the user-level software and the yield mechanism is performed in a single privilege level. For example, OS activities may be implemented in a first privilege level (e.g., a ring 0) while user-level activities may be implemented in a second privilege level (e.g., a ring 3). Using embodiments of the light-weight yield mechanism, upon a yield event control may pass from one ring 3 program directly to another function in the same ring 3 program, avoiding the need for drivers or other mechanisms to cause an OS visible interrupt.
Referring now to
Still referring to
As shown in Table 1, first the YBB is set, and then a register (i.e., ECX) may be set up and an instruction to read the current channel (i.e., EREAD) may be executed to determine whether the current channel is available. Specifically, if the valid bit of the current channel equals zero the current channel is available and accordingly, the routine of Table 1 is exited and the value of the available channel is returned. Note that by setting a match bit to zero, processor state information is not written during the EREAD instruction in routine of Table 1.
Referring back to
Still referring to
In some embodiments, a channel may be programmed using a single instruction, such as the EMONITOR instruction. Three choices may be involved in programming a channel, namely selecting a scenario, a sample-after value, and selecting between profiling and counting. First, a scenario may be selected that monitors a hardware event of interest. During operation, when this hardware event occurs, the hardware event may be counted if the channel is configured to count.
If the channel is to be used for profiling, a sample-after value is selected. The sample-after value describes the number of hardware events (defined by the scenario) to occur before an underflow bit is set. A yield is not taken until the underflow bit is already set and another triggering condition occurs. If a non-sampled profile is desired, the yield event is to be taken on every instance of the triggering condition, the underflow bit is pre-set to one, so that a sample is taken upon the first instance and every subsequent instance of the triggering condition. If instead a sampled profile is desired, the underflow bit can be set to zero, and the counter can be set to the sample-after value. The sample-after value choice determines when a scenario's counter will underflow and the channel will yield if the channel is configured to profile. For example, if a sample-after value of 100 is programmed, 100+2+X (where X is a small number dependent on a hardware implementation) hardware events will occur before the channel yields (that is, 100 events causes the counter to reach 0, an additional event sets the underflow bit, and one more event causes the yield to occur.)
Finally, programming may select between counting events and/or profiling based on the event. Counting events can be used to characterize the behavior of the processor. Profiling based on a hardware event can be used to determine what code the processor was executing when the yield occurred. In some embodiments, counting may be a lower-overhead operation than profiling. If counting is selected, the action bits can be set to 0 (e.g., such that yields will not occur) and the sample-after value set to the maximum value (e.g., 0×7FFFFFFF). If profiling is selected, the action bits can be set to 1 (e.g., causing a yield). Upon programming a channel, the valid bit may be set to indicate that the channel has been programmed (block 150). In some implementations, the valid bit may be set during programming (e.g., via a single instruction that programs the channel and sets the valid bit). Finally, the yield bit set prior to programming may be cleared (block 160). While described with this particular implementation in the embodiment of
The following pseudo-code sequence illustrates how to program a channel in accordance with one embodiment. As shown in Table 2, first multiple registers may be loaded with desired channel information. Then a single instruction, namely an EMONITOR instruction in the x86 ISA may program the selected channel with the information. As shown in Table 2 the EAX, EBX, ECX, and EDX registers may first be set up before calling a programming instruction such as the EMONITOR instruction.
Referring now to
If instead at diamond 230 it is determined that processor state matches one or more scenarios, control passes to block 240. There, a yield event request (YER) indicator for the channel or channels corresponding to the matching scenario(s) may be set (block 240). The YER indicator may thus indicate that the associated scenario programmed into a channel has met its composite condition.
Accordingly, the processor may generate a yield event for the highest priority channel having its YER indicator set (block 250). When a channel is programmed to profile, it will yield when its scenario triggers. This yield event transfers control to a service routine having its address programmed in the selected channel. Accordingly, next the service routine may be executed (block 260). Implementations of executing a service routine will be discussed further below. Note that, prior to calling the service routine, i.e., during a yield, the processor may push various values onto a user stack, where at least some of the values are to be accessed by the service routine(s). Specifically, in some embodiments the processor may push the current instruction pointer (EIP) onto the stack. Also, the processor may push control and status information such as a modified version of a condition code or conditional flags register (e.g., an EFLAGS register in an x86 environment) onto the stack. Still further the processor may push the channel ID of the yielding channel onto the stack.
Upon completion of the service routine, it may be determined whether additional YER indicators are set (diamond 270). If not, method 200 may return to block 210, discussed above. If instead additional YER indicators are set, control may pass from diamond 270 back to block 250, discussed above.
In different embodiments, service routines may take many different forms. Some service routines may be used to collect profile data, while other service routines may be used to improve program performance, e.g., via prefetching data. In any event, a service routine may execute certain high-level functions. Referring now to
Still referring to
When collecting data, a decision is made between collecting channel state data only or collecting channel and processor state data. The following pseudo-code sequence shown in Table 3 illustrates an embodiment of collecting data. Of course, other implementations are possible.
With reference still to
Finally with reference to
In some implementations once a yield has occurred, it is possible to determine if other yields are pending. For example, while executing the service routine for the channel that yielded, the state of the other channels can be read (e.g., via an EREAD instruction). If another channel's YER bit is set, that channel's scenario has triggered and a call to its service routine is pending. Data can be collected and the channel can be reprogrammed. The yield can remain pending if the channel's YER bit is not cleared.
Using this mechanism, it is possible to reduce service routine overhead by avoiding some transitions into service routines. But due to DCM, software cannot make assumptions about which channels it owns. A channel's service routine address can be used as a unique identifier if each channel is programmed with a different service routine. Each channel is unique within a specific software thread (assuming that channels are virtualized on a per software thread basis). Assuming that each software thread lives in the context of a single process, the service routine address is guaranteed to be unique.
Therefore, to handle multiple yields in a single service routine, each channel may be programmed with a unique service routine address. Then, before handling a pending yield, the channel's service routine address may be matched to one of the service routines previously programmed. The uniqueness of the service routine address can still be enforced if they share the same service routine code by having the first instruction in each (or all but one) service routine target be a jump or a call to the common service routine.
As described above, when a channel is programmed to count hardware events, it will not yield (since its action bits are cleared). Instead, software threads can periodically or at appropriate moments (e.g., entry/exit of a method) read the channel state to obtain its current hardware event count. Before a software thread reads a hardware event count, it must find the channel programmed with the appropriate scenario. Due to DCM, active scenarios may migrate to other channels. If a unique service routine address is programmed into each channel, the service routine address returned, e.g., via the EREAD instruction, can be used to uniquely identify the correct channel. The pseudo-code sequence shown in Table 5 may be used to find the channel currently programmed with a specific scenario and to save the current hardware event count.
If the event count is negative, the counter has underflowed and the channel may be re-programmed. The pseudo-code sequence of Table 6 illustrates one embodiment of hardware event count accumulation and channel reprogramming (if necessary).
The above code assumes the channel will be read before multiple underflows occur. If multiple underflows is a possibility, the action bits can be set to 1 and a service routine can be used to handle an underflow when it occurs.
Sometimes, pausing data collection may be desired. Pausing a profiling collection can be done in two different ways. To pause a collection completely, the action bits may be cleared in the appropriate channel. When the action bits are clear, the channel will continue to count but will not yield. To resume the collection, the appropriate channel's action bits may be set to 1. In order not to distort sampling intervals, the count value may be saved upon a pause, and restored when the channel usage is continued. If the YER bit of a channel was set while the channel is paused, a yield will not occur. Another mechanism to pause a profiling collection is to skip data collection in the service routine. In other words, an instruction to read the data is not invoked during a service routine when a collection is paused. The first mechanism, clearing the action bits, may result in less overhead compared to the second mechanism, as service routines are not executed. To stop collection completely, in some embodiments a single instruction to clear the valid bit in a channel may stop a profiling and/or counting collection. Once a channel's valid bit is cleared, that channel is free to be used by any other software.
If a service routine does a large amount of work, the service routine itself may be profiled. To profile a service routine, the YBB may be cleared during the execution of a service routine to allow the hardware to count and/or yield when a scenario triggers while the service routine executes. Two mechanisms can be used to clear the YBB. First, an instruction, e.g., the EWYB instruction in the x86 ISA, designed to write the YBB may be used to clear the YBB directly. Second, a different instruction, e.g., an ERET instruction in the x86 ISA, implicitly clears the YBB when it is invoked. The pseudo-code sequence of Table 7 illustrates how to clear the YBB before exiting a service routine in accordance with one embodiment.
To profile a service routine, the channel may be reprogrammed to use a different scenario and/or a small sample-after value to ensure the channel yields within the execution of the profiled part of the service routine. Or a second channel may be programmed with a small sample-after value as soon as the first channel yields. As soon as the YBB is cleared in the first channel, both channels would be active.
Many profile collection usage models allow scenarios to be multiplexed and/or the sample-after value used by a specific scenario to be modified at runtime. Other runtime modifications of channel state are also possible. To change a channel's state, the following sequence of operations may be implemented, in one embodiment: (1) set the YBB (in a multiple channel hardware implementation); (2) find the channel; (3) re-program the channel; and (4) clear the YBB (if set).
In addition, channels can be saved, re-programmed, and later restored to their original state. Thus the channel to be reprogrammed may have its state saved using, e.g., the EREAD instruction. After reprogramming and during execution, the software thread may be monitored during a specific code block or period of time. Upon completion of the monitoring, the YBB may be set, the reprogrammed channel found and the state restored, e.g., via the EMONITOR instruction using the values originally saved.
In many embodiments, two different types of scenarios exist: trap-like scenarios and fault-like scenarios. Trap-like scenarios execute their service routine after the instruction triggering the scenario has retired. Fault-like scenarios instead execute their service routines as soon as the scenario triggers, and then the instruction triggering the scenario is re-executed. Accordingly, in a fault-like scenario, the architectural register state before the scenario triggers is available for access during the service routine.
For example, the instruction mov eax <−[eax] will modify the original value of EAX during the execution. If a trap-like scenario triggers during execution of this instruction, the scenario's service routine will not be able to determine the value of EAX at the time the scenario triggered. But if a fault-like scenario triggered during this instruction, its service routine can determine the value of EAX at the time the scenario triggered.
If the trigger relates to a cache miss, for example, the address of the data that missed in the cache (i.e., the effective address) may be determined by using the architectural register state in effect before the instruction executed. Upon such determination, a prefetch routine may be inserted to thus optimize the application to prefetch the data, avoiding the cache miss. In some embodiments, software to calculate the effective address in the case of a fault-like scenario may be optimized, as only the memory address is needed by the service routine, and hence there is no need to decode an entire instruction. Thus, rather than using a full instruction decoder, an address decoder may use regularity in the instruction set to construct the memory address and data size.
In one embodiment, a fast initial path in the address decoder looks in a table to determine an instruction's memory reference mode. In other words, various instructions of an instruction set have similar memory reference modes. For example, sets of instructions may request the same length of information, or may push or pop data off a stack or the like. Accordingly, based on instruction type, efficient linear address decoding may be provided. The table entry may further include information regarding data to be obtained from the instruction for use in decoding the address. It then dispatches to a selected code fragment to construct the address for the faulting instruction. The table may be organized to ensure that common dispatch paths share cache lines, improving efficiency of sequential decodes. Accordingly, in various embodiments an instruction may be efficiently decoded to obtain linear address information, while ignoring an opcode portion of the instruction. Furthermore, the decoding may be performed rapidly in the context of a service routine, significantly reducing the expense of performing the data collection. Furthermore, this address decoding may be done in the context of the service routine itself (i.e., dynamically, in real-time), avoiding the expense of saving a significant amount of data capture and later performing full decoding, which is also an expensive process. In some embodiments, the address information obtained may be used to insert a prefetch into the code or to place the data at a different location in memory to reduce the number of cache misses. Alternately, the address information may be provided as information to the application.
Implementations may be used in architectures running managed run time applications and server applications, as examples. Referring now to
First processor 470 and second processor 480 may be coupled to a chipset 490 via P-P interfaces 452 and 454, respectively. As shown in
In turn, chipset 490 may be coupled to a first bus 416 via an interface 496. In one embodiment, first bus 416 may be a Peripheral Component Interconnect (PCI) bus, as defined by the PCI Local Bus Specification, Production Version, Revision 2.1, dated June 1995 or a bus such as the PCI Express bus or another third generation input/output (I/O) interconnect bus, although the scope of the present invention is not so limited. As shown in
Collecting profiling information with the mechanisms described above allows for low-overhead, on-line profiling and dynamic compilation. Embodiments of the light-weight control yield mechanism and its application to user-level interrupts may thus bypass the OS entirely, enabling finer-grained communication and synchronization, in a way that is transparent to the OS. Thus in various embodiments, no OS support is needed to collect and use profile information, avoiding the OS for programming and taking interrupts. Accordingly, the yield mechanisms need no device drivers, no new OS application programming interfaces (APIs), and no new instructions in context switch code. Profile data obtained using embodiments of the present invention may be used for dynamic optimizations, such as re-laying out code and data and inserting prefetches.
Embodiments may be implemented in code and may be stored on a storage medium having stored thereon instructions which can be used to program a system to perform the instructions. The storage medium may be any of various media such as disk, semiconductor device such as read-only memories (ROMs), random access memories (RAMs), erasable programmable read-only memories (EPROMs), flash memories, electrically erasable programmable read-only memories (EEPROMs), magnetic or optical cards, or any other type of media suitable for storing electronic instructions.
While the present invention has been described with respect to a limited number of embodiments, those skilled in the art will appreciate numerous modifications and variations therefrom. It is intended that the appended claims cover all such modifications and variations as fall within the true spirit and scope of this present invention.