This disclosure relates generally to data processors, and more specifically to power management for multi-threaded data processors.
Modern microprocessors for computer systems include multiple central processing unit (CPU) cores and run programs under operating systems such as Windows, Linux, the Macintosh operating system, and the like. An operating system designed for multi-core microprocessors typically distributes processing tasks by assigning different threads or processes to different CPU cores. Thus a large number of threads and processes can concurrently co-exist in multi-core microprocessors.
However there is a need for the threads and processes to synchronize and sometimes communicate with each other to perform the overall task of the application. When a CPU core reaches a synchronization or communication point, known as a barrier, it waits until another one or more threads reach a corresponding barrier. While a CPU core is waiting at a barrier, it performs no useful work.
If all concurrent threads and processes reached their barriers at the same time, then no thread would be required to wait for another and all threads could proceed with the next operation. This ideal situation is rarely encountered and the typical situation is that some threads wait for other threads at barriers, and program execution is imbalanced. There are several reasons for the imbalance, including different computational power among CPU cores, imbalances in the software design of the threads, variations of the runtime environments between the CPU cores, hardware variations, and an inherent imbalance between the starting states of the CPU cores. The result of this performance imbalance is to limit the speed of execution of the application program while some threads idle and wait at barriers for other threads.
In the following description, the use of the same reference numerals in different drawings indicates similar or identical items. Unless otherwise noted, the word “coupled” and its associated verb forms include both direct connection and indirect electrical connection by means known in the art, and unless otherwise noted any description of direct connection implies alternate embodiments using suitable forms of indirect electrical connection as well.
A data processing system as described herein is a multi-threaded, multi-processor system that allows power to be distributed among processor resources such as APUs or CPU cores by observing whether processing resources are waiting at a barrier, and if so re-allocating power credits between those processing resources and other, still-active processing resources, thereby allowing the other processing resources to complete their tasks in a shorter period of time and improving performance. As used herein, a power credit is a unit of power that is a fraction of a total power budget that may be allocated to a resource such as a CPU core for a period of time.
In one form, such a data processing system includes processor cores each operable at a selected one of a plurality of performance states, a thread manager for assigning program threads to respective processor cores, and synchronizing program threads using barriers, and a power distributor coupled to the thread manager and to the processor cores, for assigning a performance state to each of the plurality of processor cores within an overall power budget, and in response to detecting that a program thread assigned to a first processor core is at a barrier, decreasing the performance state of the first processor core and increasing the performance state of a second processor core that is not at a barrier while remaining within the overall power budget.
In another form, a data processing system includes a cluster manager and a set of node manager corresponding to each of a plurality of processor nodes. Each node includes a plurality of processor cores, each operable at a plurality of performance states. The cluster manager assigns a node power budget to each node. Each node has a corresponding node manager. Each node manager includes a thread manager and a power distributor. The thread manager assigns program threads to respective ones of the plurality of processor cores, and synchronizes the program threads using barriers. The power distributor is coupled to the thread manager and to the processor cores, and assigns a performance state to each of the plurality of processor cores within a corresponding node power budget, and in response to detecting that a program thread assigned to a first processor core is at a barrier, decreasing the performance state of the first processor core and increasing the performance state of a second processor core that is not at a barrier within the node power budget.
Application layer 110 is responsive to any of a set of application programs 112 that interface to lower system layers through an application programming interface (API) 114. API 114 includes application and runtime libraries such as the Message Passing Interface (MPI) developed by MPI Working Group, the Open Multi-Processing (OpenMP) interface developed by the OpenMP Architecture Review Board, the Pthreads standard for creating and manipulating threads (IEEE Std 1003.1c-1995), Thread Building Blocks (TBB) defined by the Intel Corporation, the Open Computing Language (OpenCL) developed by the Khronos Group, and the like.
Runtime system 120 includes generally a cluster manager 130 and a set of node managers 140. Cluster manager 130 is used for overall system coordination and is responsible for maintaining the details of the processes involved in all nodes in the cluster. Cluster manager 130 includes a process manager 134 that assigns processes to each of the nodes, and a cluster level power distributor 132 that coordinates with process manager 134 to distribute power credits to each node. A node manager is assigned to each node in the cluster such that an instance of the node manager is running on each node. Each node manager such as representative node manager 150 includes a thread manager 154 that manages the thread distribution within the node, and a node-level power distributor 152 that is responsible for determining the power budget for its node based on the number of CPU cores within the node. Cluster manager 130 and node managers 140 communicate initially to exchange power budget information, and then periodically exchange information at every budget change, e.g. when a thread reaches a barrier as will be described further below.
Platform layer 160 includes a set of processor resources for execution of the application programs. In one form, platform layer 160 includes a set of nodes 170 including a representative node 180. The interfaces in application layer 110 and runtime system 120 are designed to operate on a variety of hardware platforms and with a variety of processor resources. In the example of
A data processing system using runtime system 120 is able to handle power credit re-allocation automatically in hardware and software and does not require source code changes for legacy application programs. Moreover it improves the performance of applications that use barriers for process and/or thread synchronization within a given power budget. In some cases, it provides the opportunity to improve performance and save power at the same time, since processes and threads complete faster and don't require resources such as CPU cores to consume power while idling.
Process manager 210 includes a process wrapper 212 and a link-time API interceptor 214. Process wrapper 212 is a descriptor for each process existing in the system and includes a process identifier labeled “PID”, a map between the PID and the node labeled “PID_NodeID_Map”, a number of threads associated with the process labeled “# of Threads”, and a state descriptor, either Idle or Active, labeled “State”. These elements of process wrapper 212 are duplicated for each process in the system. Link-time API interceptor 214 is a software module that includes elements such as a process creation component module, a barrier handler, and the like. The process creation module creates a library similar to MPI, Pthreads, etc. and imitates the signature of the original library. This duplicate library in turn links to and calls the APIs from the original library. This capability allows applications running in this environment to avoid the need for source code changes, simplifying the task of programmers. The barrier handler facilitates communication between different processes waiting at a barrier.
Thread manager 220 includes components similar to process manager 210, including a thread wrapper 222, a link-time API interceptor 224, and an additional dynamic thread-core affinity remapper 226. Thread wrapper 222 is a descriptor for each thread assigned to a corresponding node and includes a thread identifier labeled “TID”, a map between the TID and the specific core the thread is assigned to labeled “TID_CoreID_Map”, and a state descriptor, either Idle or Active, labeled “State”. These elements of thread wrapper 222 are duplicated for each thread assigned to the corresponding node. Link-time API interceptor 224 includes elements such as a thread creation component module that creates a library similar to MPI, Pthreads, etc. and imitates the signature of the original library. This duplicate library in turn links to and calls the APIs from the original library. This capability allows applications running in this environment to avoid the need for source code changes, simplifying the task of programmers. Thread manager 220 also includes a dynamic thread-core affinity remapper 226, which uses processor affinity APIs provided by the operating system libraries to migrate a thread from one core to another. Thus when the number of threads is greater than the number of cores, idle threads can be fragmented onto different cores. By defragmenting such idle threads, thread manager 220 is able to better utilize the available cores and thus power credits.
Runtime system component 300 includes generally a power distributor 310 and a manager 320. Power distributer 310 is responsive to a user-defined or system-configured power budget to perform a distribution process which begins with a step 312 which distributes an initial power budget for each node in the cluster (if runtime system component 300 is a cluster manager) or for each core in the node (if runtime system component 300 is a node manager). Subsequently as the application starts and continues to run on the platform resources, power distributor 310 goes into a loop which starts with a step 314 that, responsive to inputs from a manager 320, monitors budget change events. These events include the termination or idling of a thread or process and a thread or process reaching a barrier. In response to such a budget change event, power distributor 310 proceeds to step 316, in which it re-distributes power credits. For example when manager 320 signals that a thread is at a barrier, it claims power credits from the corresponding processor and re-distributes the power credits to one or more active processors. By doing so, an active processor reaches its barrier faster and resolution of the barrier occurs sooner, resulting in better performance. After redistributing the power credits, power distributor 310 returns to step 314 and waits for subsequent budget change events.
Thus manager 320 identifies the processes/threads waiting at a barrier. These idle resources may be placed in the lowest P-state, a lower C-state, or even power gated. As they become idle, there may be some other processes/threads that are still actively executing. Manager 320 reallocates power credits from the resources associated with the idle processes/threads, and transfers them to the active processes/threads to allow them to reach the barrier faster. For example, manager 320 can take the aggregate available power credits from idle resources and re-distribute them evenly across the remaining, active resources. When additional threads/processes reach the barrier, manager 320 performs this re-allocation iteratively until all the process/threads reach the barrier. After that, the power credits are reclaimed and returned back to their original owners. Manager 320 boosts the active processes/threads consistent with the power and thermal limits allowed by the resource. In some embodiments, boosted threads can temporarily utilize non-sustainable performance states such as hardware P0 or boosted P0 states, instead of just being limited to sustainable power states such as software P0 states, as long as the total power is within the overall node power budget.
For example, a simple multi-threaded system may assign only one process (thread) to each node (core). In essence, there is one-to-one mapping. In this case, as the processes/threads become idle their nodes/cores can be put in low power states in order to boost the frequency of the nodes/cores that correspond to active processes/threads.
Power allocation can become much more complicated if there is a many-to-one mapping between the processes/threads to nodes/cores. For example, if there are two threads mapped to a core, then it is possible that one thread may be active and the other thread idle at a barrier. In such a case, idle threads could be fragmented across different cores, leading to poor utilization of the power budget. Such a situation can be handled in the following way. First, the runtime system could identify an opportunity for defragmenting such idle threads across different cores. It could group them in such a way that all idle threads are mapped to a single core, and the active threads get evenly distributed across the remaining cores. This way the active threads and corresponding cores will be able to borrow maximum power credits and boost their performance to reach the barrier faster. Later during power credit reclamation, the idle threads would be remapped to their original cores as they become active. One downside to this approach is added overhead due to migration, such as additional cache misses as the runtime system moves threads to other cores; however, this overhead can be mitigated by deeper cache hierarchies.
Next at step 430, thread manager 154 detects that a first processor core is at a barrier. For example, assume CPU core 182 encounters a barrier. Thread manager 154 detects this condition and signals node-level power distributor 152, which is monitoring budget change events, that CPU core 182 has encountered a barrier. In response, node-level power distributor 152 re-distributes power credits between CPU core 182 and CPU core 184. It does this by decreasing the corresponding one of the multiple performance states of the first processor core in step 440, and increasing the corresponding one of the plurality of performance states of a second processor core, e.g. CPU core 184, that is not at the barrier in step 450. For one example, node-level power distributor 152 places CPU core 182, which is waiting at a barrier, into the P6 state while placing CPU core 184, which has not yet encountered the barrier, into the P0 state. Thus CPU core 184 is now able to get to its barrier faster. When CPU core 184 eventually reaches the barrier also, runtime system 120 synchronizes the cores, and resumes operation by again placing both CPU cores in the P2 state.
As shown in
In other embodiments, a data processing system could be responsive to the progress of threads toward reaching a barrier. A node manager can monitor the progress of threads toward a common barrier, for example by checking the progress at certain intervals. If one thread is significantly ahead of other threads, the node manager can reallocate the power credits between the threads and the CPU cores running the threads to reduce the variability in completion times.
Although in the illustrated embodiment application layer 110 and runtime system 120 are software components and platform layer 160 is a hardware component, these three layers may be implemented with various combinations of hardware and software, such as with embedded microcontrollers. Some of the software components may be stored in a computer readable storage medium for execution by at least one processor. Moreover the method illustrated in
Moreover, any one or multiple ones of the processor cores in platform layer 160 of
While particular embodiments have been described, various modifications to these embodiments will be apparent to those skilled in the art. In the illustrated embodiment, each node included two CPU cores and one GPU core. In other embodiments, each node could include more processor cores. Moreover the composition of the processor cores could vary in other embodiments. For example, instead of including two CPU and one GPU core, a node could include eight CPU cores. In another example, a node may comprise multiple die stacks of CPU, GPU, and memory. Moreover, more variables besides clock frequency and power supply voltage could define a performance state, such as whether dynamic power gating is enabled.
Accordingly, it is intended by the appended claims to cover all modifications of the disclosed embodiments that fall within the scope of the disclosed embodiments.