1. Field of the Invention
The present invention is directed in general to the field of data processing systems. In one aspect, the present invention relates to performance optimization within a data processing system. In yet another aspect, the present invention relates to a data processing system and method for dynamically prioritizing instruction thread execution to optimize processing of threads in a multiprocessor system.
2. Description of the Related Art
In multi-processor computer systems in which different system resources (such as CPUs, memory, I/O bandwidth, disk storage, etc.) are each used to operate on multiple instruction threads, there are significant challenges presented for efficiently executing instruction threads so that the system resources are optimally used to run all workloads. These challenges only increase as the number and complexity of cores in a multiprocessor computer grows. Conventional processor approaches have attempted to address workload optimization at the various design phases (e.g., from high level abstract models to VHDL models) by simulating the processor operations for both function and performance, and then using the simulation results to design the scheduler or workload manager OS components to allocate system resources to workloads. However, because schedulers and workload managers are software components, the optimizations achieved by these components tend to address high-level performance issues that can readily be monitored by software. As a result, low-level performance issues, such as hardware allocation of shared resources among multiple threads, are not addressed by conventional software-only techniques of performance optimization. Another problem with such conventional system solutions is that there is very often no single a priori correct decision for how to best allocate system resources to individual instruction thread requests, such as steering a request from a core to another system resource, or deciding which request gets to memory first. When the “best” system resource allocation algorithm is selected for the majority of workloads, this resulting in tradeoffs being made which give priority to certain operations or requests at the expense of others. Such tradeoffs can affect all workloads being run on the system, and in some cases end up decreasing the efficiency of execution when the wrong priority is assumed for a given instruction stream.
Accordingly, there is a need for a system and method for determining how to prioritize instruction threads in a multiprocessor system so that workload operations on the system are optimized. In addition, there is a need for an instruction stream prioritization scheme which can be dynamically changed during system operation. Further limitations and disadvantages of conventional solutions will become apparent to one of skill in the art after reviewing the remainder of the present application with reference to the drawings and detailed description which follow.
A dynamic instruction prioritization system and methodology are provided for a multiprocessor system wherein instructions in a given thread or stream are referenced with a priority value so that the priority values for different threads can be used to efficiently allocate system resources for executing the instructions. By evaluating the performance for each instruction thread, the priority of an instruction stream can be dynamically moved up or down during the execution of a workload based on operating system or application priorities. Using a plurality of thread priority registers that are distributed at different locations throughout the multiprocessor system (e.g., L1 cache, L2 cache, L3 cache, memory controller, interconnect fabric, I/O controller, etc.), the priority value for an individual thread can be distributed throughout the multiprocessor system, or can be directed to particular resources in the system and not others in order to target thread behavior in particular functions. In this way, the thread priority may be retrieved from a thread priority register at each (selected) hardware unit as an instruction stream is executed so that decisions are efficiently made concerning data flow, order of execution, prefetch priority decisions and other complex tradeoffs. With the thread priority registers, the thread priority may be saved with the state of a thread whenever the thread is preempted by a higher priority request. By propagating the thread priority registers, the thread priority can be used not only at a core level in a multi-core chip, but also at a system level.
Selected embodiments of the present invention may be understood, and its numerous objects, features and advantages obtained, when the following detailed description is considered in conjunction with the following drawings, in which:
A method, system and program are disclosed for dynamically assigning and distributing priority values for instructions in a computer system based on one or more predetermined thread performance tests, and using the assigned instruction priorities to determine how resources are used in the system. To determine a priority level for a given thread, control software (e.g., the operating system or hypervisor) uses performance monitor events for the thread to evaluate or test the thread's performance and to prioritize the thread by applying a predetermined policy based on the evaluation. The test results may be used to optimize the workload allocation of system resources by dynamically assigning thread priority values to individual threads using any desired policy, such as achieving thread execution balance relative to thresholds and to performance of other threads, reducing thread response time, lowering power consumption, etc. In various embodiments, the assigned priority values for each thread are stored in thread priority registers located in one or more hardware locations in the processor system. This is done upon dispatch of a thread when the control software executes a store to a first thread priority register based on OS-level priorities for the process initiating the thread. After the priority value for a particular thread is stored to the first thread priority register, the priority value is distributed or copied to the other thread priority registers in the system. After that point, each hardware unit checks, as part of instruction execution for that thread, the thread-specific priority register for that hardware unit to determine the priority of the thread. As a result, any load or store or other fabric instruction generated by the instruction checks the local thread priority register for the instruction's priority value. Thus, as an instruction or command flows through the system, units that respond to those commands can retrieve the priority from the local thread priority register and decide on which commands to execute first.
Various illustrative embodiments of the present invention will now be described in detail with reference to the accompanying figures. It will be understood that the flowchart illustrations and/or block diagrams described herein can be implemented in whole or in part by dedicated hardware circuits, firmware and/or computer program instructions which are provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions (which execute via the processor of the computer or other programmable data processing apparatus) implement the functions/acts specified in the flowchart and/or block diagram block or blocks. In addition, while various details are set forth in the following description, it will be appreciated that the present invention may be practiced without these specific details, and that numerous implementation-specific decisions may be made to the invention described herein to achieve the device designer's specific goals, such as compliance with technology or design-related constraints, which will vary from one implementation to another. While such a development effort might be complex and time-consuming, it would nevertheless be a routine undertaking for those of ordinary skill in the art having the benefit of this disclosure. For example, selected aspects are shown in block diagram form, rather than in detail, in order to avoid limiting or obscuring the present invention. In addition, some portions of the detailed descriptions provided herein are presented in terms of algorithms or operations on data within a computer memory. Such descriptions and representations are used by those skilled in the art to describe and convey the substance of their work to others skilled in the art. Various illustrative embodiments of the present invention will now be described in detail below with reference to the figures.
Referring now to
As further depicted in
The processing units communicate with other components of system 100 via a system interconnect or fabric bus 50. Fabric bus 50 is connected to one or more service processors 60, a system memory device 61, a memory controller 62, a shared or L3 system cache 66, and/or various peripheral devices 69. A processor bridge 70 can optionally be used to interconnect additional processor groups. Though not shown, it will be understood that the data processing system 100 may also include firmware which stores the system's basic input/output logic, and seeks out and loads an operating system from one of the peripherals whenever the computer system is first turned on (booted).
As depicted in
As disclosed herein, the locally-stored thread priority values may be used by the system resource to choose between competing requests from different threads. To this end, each system resource may also include an arbiter circuit which takes the requests and, incorporating the priorities in the thread priority register, chooses one of the requests to access the system resource. Thus, each L1 cache includes an L1 arbiter (e.g., 17a, 17b, 47a, 47b), each L2 cache includes an L2 arbiter (e.g., 13, 43), the L3 cache includes an L3 arbiter 67, the interconnect bus includes an interconnect arbiter 51, and the memory controller includes an MC arbiter 63. With this structure, the thread priority register 1 is replicated around the system 100 in the various hardware resources. Each thread priority register 18a, 18b, 48a, 48b, 14, 44, 51, 64, 68 shows at least two threads with ids {0, 1} which have corresponding priority levels of {A, B}.
The system memory device 61 (random access memory or RAM) stores program instructions and operand data used by the processing units, in a volatile (temporary) state, including the operating system 61A and application programs 61B. In addition, the thread priority adjustment module 61C may be stored in the system memory in any desired form, such as an operating system module, hypervisor component, etc, and is used to control the initial priority in the thread priority register of a first processor core (e.g., 16a), which may be lazily propagated through the system 100. Priority does not have to be always precise and can take as many cycles as necessary to propagate. Another network could propagate the thread priority register 44 from another processor core (e.g., 46b) or any other element. Also, priorities can be directed to particular registers in the system and not others in order to target thread behavior in particular functions. Although illustrated as a facility within system memory, those skilled in the art will appreciate that thread priority adjustment module 61C may alternatively be implemented within another component of data processing system 100. The thread priority adjustment module 61C is implemented as executable instructions, code and/or control logic including programmable registers which is operative to check performance monitor information for threads running on the system 100, and to assign priority values to each thread using predetermined policies which are distributed and stored across the system 100 using thread priority registers 1, as described more fully below.
Those skilled in the art will appreciate that data processing system 100 can include many additional or fewer components, such as I/O adapters, interconnect bridges, non-volatile storage, ports for connection to networks or attached devices, etc. Because such components are not necessary for an understanding of the present invention, they are not illustrated in
Referring now to
While any desired circuit design may be used to implement the functional logic for the thread priority register 204,
The disclosed thread priority register may be located at individual hardware units and used to store priority tag values for instructions in a particular thread that are used by system resources to help make the right system allocation decisions. As an example embodiment, a plurality of thread priority registers are allocated in hardware for every thread that can execute in the system, such that registers are located at a plurality of hardware locations. Upon dispatch of a thread, priority control logic (e.g., in the hypervisor or OS) executes a store to the thread priority registers based on OS-level priorities for the process initiating the thread, and as a result, every instruction from a thread that is fetched has an associated priority value that is locally stored in a thread priority register. With thread priority registers distributed throughout the system in or near any of the system resource locations where instructions from the thread are executed, an instruction or command can flow through the system with a specific priority, and individual hardware resource units can respond to the instruction/commands by using the assigned priority values to decide which instruction/commands to execute first. Specific examples of hardware unit tradeoffs that could be made include:
In selected embodiments, separate thread priority registers may be located near any system resource that can be granted access by multiple requesters. Examples of possible locations in the processor system for separate thread priority registers are set forth below in Table 1, which lists candidate locations along with corresponding example actions being requested at each location.
To illustrate how the thread priority registers may be located and used in different hardware resources,
In operation, the arbiter module 488 tracks and manages the allocation and availability of at least the resources (e.g., execution units, rename and architected registers, cache lines, etc.) within processing core 400 by using a locally-stored thread priority register (TPR) 481 which tracks the priority values assigned to instructions in each instruction thread being executed by the processing core 400. By storing the assigned thread priority tag values in the TPR 481, any load or store or other fabric instruction generated by the instruction also inherits that priority tag value since it will have the same thread id as its parent. Alternatively, when the thread id already exists as part of instruction execution, operations in the system simply check the thread-specific priority register (or distributed copies of it) to determine the priority of a thread. In the depicted thread priority register 481, two threads are shown with thread ids {0, 1} and corresponding priority levels of {A, B}. Using the priority values assigned to each thread and stored in the TPR 481, the arbiter module 488 allocates resources to instruction threads so that the execution units, registers and cache required for execution are allocated to the prioritized instructions. As the arbiter module 488 allocates resources needed by particular instructions buffered within instruction buffer 482 by reference to thread priority register 481, dispatcher 484 within ISU 450 dispatches the instructions from instruction buffer 482 to execution units 460-468, possibly out-of-program-order, based upon instruction type. Thus, condition-register-modifying instructions and branch instructions are dispatched to condition register unit (CRU) 460 and branch execution unit (BEU) 462, respectively; fixed-point and load/store instructions are dispatched to fixed-point unit(s) (FXUs) 464 and load-store unit(s) (LSUs) 466, respectively; and floating-point instructions are dispatched to floating-point unit(s) (FPUs) 468. After possible queuing and buffering, the dispatched instructions are executed opportunistically by execution units 460-468.
During execution within one of execution units 460-468, an instruction may receive input operands, if any, from one or more architected and/or rename registers within a register file 470-474 coupled to the execution unit. Data results of instruction execution (i.e., destination operands), if any, are similarly written to register files 470-474 by execution units 460-468. For example, FXU 464 receives input operands from and stores destination operands to general-purpose register file (GPRF) 472, FPU 468 receives input operands from and stores destination operands to floating-point register file (FPRF) 474, and LSU 466 receives input operands from GPRF 472 and causes data to be transferred between L1 D-cache 418 and both GPRF 472 and FPRF 474. In transferring data to the L1 D-cache 418, a shared data memory management unit (DMMU) 480 may be used to manage virtual to physical address translation. When executing condition-register-modifying or condition-register-dependent instructions, CRU 460 and BEU 462 access control register file (CRF) 470 which contains a condition register, link register, count register and rename registers of each. BEU 462 accesses the values of the condition, link and count registers to resolve conditional branches to obtain a path address, which BEU 462 supplies to instruction sequencing unit 450 to initiate instruction fetching along the indicated path. After an execution unit finishes execution of an instruction, the execution unit notifies ISU 450, which schedules completion of instructions in program order. Arbiter module 488 also updates TPR 481 to reflect the release of the resources allocated to the completed instructions.
To provide an additional illustration of how a thread priority register may be used at a particular hardware resource to choose between competing requests being made of the resource,
While any desired circuit design may be used to implement the functional logic for the L2 cache arbiter 505,
As an instruction stream executes, a thread priority adjustment control may be implemented in the OS, hypervisor or in an application to dynamically adjust the priority for individual threads. Since the OS already has mechanisms to keep track of priority and allow the application or user to adjust these, these same priorities can be used to bias the thread priority. Alternatively, the thread priority adjustment control can monitor the performance status of individual threads, and upon determining that a change in priority is warranted, can change up or down the priority value(s) stored in the thread priority register to thereby impact the performance of the particular thread. An example of a thread priority adjustment control module 61C is depicted in
To assist with the dynamic prioritization of the threads, a hardware (HW) monitor (e.g., HW monitor 486 in
By providing the performance parameters to the thread priority adjustment control, any of a variety of predetermined policies may be applied to revise the thread priorities based on system conditions. For example, when prompted, the OS/hypervisor code implementing the thread priority adjustment control checks performance status information for a thread and compares this information to thresholds or performance status information for other threads. Based on this comparison, the OS/hypervisor code resets priorities in the thread priority registers. Set forth below in Table 2 is a listing of various performance tests that can be run on individual threads, along with a corresponding policy for adjusting the thread.
The contemplated tests or comparisons listed in Table 2 are used to achieve thread execution balance relative to thresholds and to performance of other threads. However, in other embodiments the goal may be thread response time, power reduction, etc.
Using the thread priority adjustment control, the priority for a particular thread id may be set by having the thread priority adjustment control execute code to check performance status information provided by the hardware monitor(s). For purposes of illustration, an example pseudocode is shown below which could be used by the OS/Hypervisor uses to check the performance status information for threads and assign priorities by setting the thread priority register values:
In the example pseudocode, the CPIs, cache misses, and branch predictabilities of the threads are compared to thresholds and to each other to determine priorities. This pseudocode also shows the targeting of particular functions based on the comparison results, where CPI( ), L2_CACHE_MISSES( ) and BRANCH_PREDICTABILITY( ) are functions that return the performance status information, and SET_PRIORITY( ) is a function that sets the particular register priority values using the parameters input to the function.
To illustrate selected embodiments of the present invention,
To further illustrate selected embodiments of the present invention,
In accordance with various embodiments disclosed herein, instructions from different instruction threads may be prioritized in a data processing system under software control using the methodologies and/or apparatuses described herein, which may be implemented in a data processing system with computer program code comprising computer executable instructions. In whatever form implemented, a first priority value is assigned to a first instruction thread and a second priority value is assigned to a second instruction thread. These priority values are then stored in a first thread priority register and then replicated to a plurality of thread priority registers located in the data processing system, such as in the L1 cache memory, L2 cache memory, L3 cache memory, memory controller, execution unit, interconnect bus, or interconnect controller. In selected embodiments, the priority values may be replicated by allocating a plurality of thread priority registers in hardware for every thread that can execute in the data processing system, and then lazily propagating priority values from the first thread priority register through the plurality of thread priority registers. In each thread priority register, a first priority value is stored for instructions from a first instruction thread and a second priority value is stored for instructions from a second instruction thread. When a request from a first instruction in the first instruction thread is presented to access the first hardware resource, the first hardware resource is allocated based on the first priority value retrieved from the local thread priority register. For example, if the first hardware resource is presented with competing requests from instructions in the first and second instruction threads, the first hardware resource is allocated by comparing first priority value to the second priority value so that the instruction thread with the higher priority is given access to the hardware resource. Examples of hardware allocation results include, but are not limited to, selecting a core load or prefetch request from the first instruction thread to be performed before performing a request from another instruction thread when the first instruction thread has a higher priority value. By replicating the priority values in a plurality of thread priority registers located in a corresponding plurality of hardware resources in the data processing system, the instruction prioritization benefits can be extended to other resources in the data processing system. In addition, performance status information for an instruction thread may be monitored and used to adjust a priority value for that thread, such as by applying a policy to achieve thread execution balance between the first instruction thread and at least one additional instruction thread. For example, the performance status information may be monitored by measuring a cycles per instruction parameter, a cache miss parameter, a branch predictability parameter, a core stall parameter, a prefetch hit parameter, a load/store frequency parameter, an FXU instruction parameter, an FPU instruction parameter, an application indicator parameter or a core utilization parameter.
As will be appreciated by one skilled in the art, the present invention may be embodied in whole or in part as a method, system, or computer program product. As will be appreciated, the use of multiple thread priority registers to store and distribute thread priority values will work well for lightly threaded core architectures by avoiding the need to add extra tag bits to each instruction for priority values, not to mention the processing overhead at each hardware unit to extract the priority values from the instruction. Thus, in the case of heavier designs (like POWER6/7, Intel or AMD), relatively few threads are implemented per core, and as a consequence, it may be less costly to maintain multiple thread priority registers or tables than having extra tag bits added to instructions that would require wider system/fabric busses. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, the present invention may take the form of a computer program product on a computer-usable storage medium having computer-usable program code embodied in the medium. For example, the functions of adjusting the thread priority levels by applying policies to detected performance conditions at the hardware resources may be implemented in software that is centrally stored in system memory or executed as part of the operating system or hypervisor.
The foregoing description has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed. Many modifications and variations are possible in light of the above teaching. It is intended that the scope of the invention be limited not by this detailed description, but rather by the claims appended hereto. The above specification and example implementations provide a complete description of the manufacture and use of the composition of the invention. Since many embodiments of the invention can be made without departing from the spirit and scope of the invention, the invention resides in the claims hereinafter appended.