1. Technical Field
The present invention relates generally to data processing systems and specifically to multiprocessor data processing systems. Still more particularly, the present invention relates to load balancing among processors of a multiprocessor data processing system.
2. Description of the Related Art
In order to more efficiently complete execution of software code, processors of most conventional data processing systems process code as threads of instructions. With multiprocessor data processing systems (MDPS), threads are utilized to enable definable division of labor amongst various processors when processing code. Multiple threads may be processed by a single processor and each processor may simultaneously process a different thread. Those skilled in the art are familiar with the use of threads and scheduling of threads of instructions for execution on processors.
The processors in MDPS operate in concert with each other to complete the various tasks performed by the data processing system. These tasks are assigned to specific processors or shared among the processors. Because of various factors, it is quite common for the processing loads shared among the processors to be unevenly distributed. In fact, in some instances, one processor in the MDPS may be idle (i.e., not currently processing any threads) while another processor in the MDPS is very busy (i.e., assigned to process several threads).
Current load balancing algorithms in AIX allow an idle (second) processor to “steal” a thread from an adequately busy first processor. When this stealing of a thread is completed, the thread's run queue assignment (i.e., the processor queue to which the thread is assigned for execution) is changed, so that the stolen thread becomes semi-permanently assigned to the stealing processor. The stolen thread will then have a strong tendency to be serviced by this processor in the future. With the conventional algorithm/protocol for stealing threads, the initial dispatch(es) of the thread's instructions on the stealing processor typically encounters extra cache misses, although subsequent re-dispatches eventually become efficient.
Because the thread stealing algorithm causes extra cache misses during the initial dispatch(es), conventional algorithms have introduced a stealing “barrier” that prevents stealing threads from processors that are not overloaded (or not close to being overloaded). This use of a stealing barrier trades off wasted processor cycles against inefficient utilization of processor cycles, which may result from overly aggressive thread stealing, by perhaps leaving an idle processor in an idle state.
The newer POWER™ processor models potentially have an additional penalty when stealing threads. This additional penalty is caused because of the multi-chip-module (MCM)-based architecture utilized in designing the POWER processor models. In POWER processor design, an MCM is a small group of processors (e.g., four processors) that share L3 cache and physical memory. MCMs may be connected to other MCMs in a larger system that provides enhanced processing capabilities.
Because of the shared cache and memory configuration for processors of an MCM stealing threads within an MCM (i.e., stealing from a first processor of a first MCM by a second processor of the same, local MCM) is more desirable than stealing from a processor in second, non-local MCM. With the advent of new memory affinity controls for processes in AIX 5.3, for example, an executing process may have its memory pages backed in storage local to the MCM, making it especially desirable to limit stealing to within the MCM.
Further, it is well known that allowing stealing more freely will seriously impact the stolen thread's memory locality and cause noticeable degradation of performance for the stolen thread. The degradation of performance caused by stealing threads (as well as other negative effects of stealing threads) is even more pronounced when the thread is stolen from another MCM. Thus, while restricting cross-MCM thread stealing may result in more wasted cycles on idle processors, allowing cross-MCM thread stealing leads to measurable degradation to the threads involved. This degradation is in part due to long term remote execution and inconsistent performance for that thread. Stealing threads across MCMs is, therefore, particularly undesirable.
Some developers have suggested an approach called “remote execution.” In some instances, an entire process created at a home node (MCM) is off-loaded to a remote node (MCM) for an extended period of time and may eventually be moved back to the home node (MCM). Often, all of the memory objects of the process are later moved to the new node (which then becomes the home node). While the time frame for moving the memory objects may be delayed with this method, the method introduces the same penalties as up-front stealing of threads across MCMs or running threads for extended periods on a remote MCM while the thread's memory objects are at a different home MCM.
Consequently, the present invention recognizes that a new mechanism is desired that will allow idle processor cycles to be used without permanent degradation to the threads assigned to these idle cycles. A new load balancing algorithm for MCM-to-MCM balancing that prevents long term degradation to the threads involved would be a welcomed improvement. These and other benefits are provided by the invention described herein.
A method and system are disclosed that enables efficient load balancing between a first processor with idle processor cycles in a first MCM (multi-chip module) and a second busy processor in a second MCM, without significant degradation to the thread's execution efficiency when allocated to the idle processor cycles. The invention is applicable to a multiprocessor data processing system (MDPS) that includes two or more multi-chip modules (MCMs) and a load balancing algorithm that supports both stealing and borrowing of threads across MCMs.
An idle processor is allowed to “borrow” a thread from a busy processor in another memory domain (i.e., across MCMs). The thread is borrowed for a single dispatch cycle at a time. When the dispatch cycle is completed, the thread is released back to its parent processor. If it is determined that the borrowing processor will become idle after the dispatch cycle, the borrowing processor re-scans the entire MDPS for another thread to borrow.
The next borrowed thread may come from the same lending processor or from another busy processor. Also, the lending processor may loan a different thread to the borrowing processor. Thus, the allocation algorithm does not “assign” a thread to another MCM. Rather the thread is run on the other MCM for a single dispatch cycle at a time, and execution of the thread is immediately returned to the home (lending) processor at the other MCM.
By causing the borrowing processor to release the thread and then rescan the entire MDPS, the algorithm substantially diminishes the likelihood that any single thread will run continuously on a particular borrowing processor. Accordingly, the algorithm also substantially diminishes the likelihood that any performance penalty will accumulate against the borrowed thread caused by loss of memory locality since any new memory objects created by the borrowed thread will be allocated locally with respect to its home MCM.
The above as well as additional objectives, features, and advantages of the present invention will become apparent in the following detailed written description.
The invention itself, as well as a preferred mode of use, further objects, and advantages thereof, will best be understood by reference to the following detailed description of an illustrative embodiment when read in conjunction with the accompanying drawings, wherein:
The present invention provides a method and system that enables efficient load balancing between a first processor with idle processor cycles in a first MCM (multi-chip module) and a second busy processor in a second MCM, without significant (long term) degradation to the thread's execution efficiency when allocated to the idle processor cycles. The invention is applicable to a multiprocessor data processing system (MDPS) that includes two or more multi-chip modules (MCMs) and a load balancing algorithm that supports both stealing and borrowing of threads across MCMs.
As utilized herein, the term “idle” refers to a processor that is not presently processing any threads or does not have any threads assigned to its thread queue. “Busy” in contrast refers to a processor with several threads scheduled for execution within the processor's thread queue. This parameter may be defined within the load balancing algorithm as a specific number of threads (e.g., 4 threads) within the processor's thread queue. Alternatively, the busy parameter may be defined based on a calculated average across the MDPS during processing, where a processor that is significantly above the average is labeled as busy, relative to the other processors. The load balancing algorithm maintains (or attempts to maintain) a smoothed average load value, determined by repeatedly sampling the queue lengths of each processor.
An idle processor is allowed to “borrow” a thread from a busy processor in another memory domain (i.e., across MCMs). The thread is borrowed for a single dispatch cycle at a time. When the dispatch cycle is completed, the thread is released back to its parent processor. If it is determined that the borrowing processor will become idle after the dispatch cycle, the borrowing processor re-scans the entire MDPS for another thread to borrow.
The next borrowed thread may come from the same lending processor or from another busy processor. Also, the lending processor may loan a different thread to the borrowing processor. Thus, the allocation algorithm does not “assign” a thread to another MCM. Rather the thread is run on the other MCM for a single dispatch cycle at a time, and execution of the thread is immediately returned to the home (lending) processor at the other MCM.
By causing the borrowing processor to release the thread and then rescan the entire MDPS, the algorithm substantially diminishes the likelihood that any single thread will run continuously on a particular borrowing processor. Finally, all references made to memory objects by the borrowed thread are resolved with memory local to the lending MCM, not to the MCM actually executing the borrowed thread. The borrowed thread remains optimized for future execution on its “home” MCM. Accordingly, the algorithm also substantially diminishes the likelihood that any performance penalty will accumulate against the borrowed thread caused by loss of memory locality since the process does not require cross-MCM migration of memory objects when it runs on its home MCM.
With reference now to the figures and in particular to
MCM1110 is connected to MCM2120 via a switch 105. Switch 105 is a collection of connection wires that, in one embodiment, enables each processor of MCM1110 to directly connect to each processor of MCM2120. Switch 105 also connects memory 130, 131 to its respective local MCM (as well as to the non-local MCM).
During operation of MDPS 100, each processor (or central processing unit (CPU) is assigned an execution queue (or thread queue) 140 within which threads (labeled Th1 . . . THn) are scheduled for execution by the particular processor. At any given time during processing, the number of threads (i.e., load) being handled (sequentially executed) by any one of the processors may be different from the number of threads (load) being handled by another processor. Also, the overall load of one MCM (e.g., MCM1110) may be very different from that of the other MCM (MCM2120). An indication of the relative load of each processor is provided in
Thus, as illustrated, processors P1 and P4 of MCM1110 have long queues with four (or more) threads scheduled, and P1 and P4 are labeled as “busy”. Processors P2 and P3, also of MCM1110 and processor P5 of MCM2120 have medium length queues (with two threads scheduled), and P2, P3, and P5 are labeled as “average”. Processors P7 and P8 of MCM2120 are labeled as “low” since they have short queues with only one thread scheduled, respectively. Finally, processor P6 of MCM2120 has an empty queue (i.e., no threads scheduled), and P6 is labeled as idle.
The specific thread counts provided herein are for illustration only and not meant to imply any limitations on the invention. Specifically, while an idle processor is described as having no threads assigned thereto, it is understood that the threshold for determining which processor has idle cycles and is a candidate for borrowing (or stealing) threads is set by the load balancing algorithm implemented within the particular MDPS. This threshold may be a processor with two or three or ten threads scheduled depending to some extent on the depth of the thread queues and operating parameters of the processor(s). However, the illustrative embodiment assumes that a borrowing/stealing processor borrows (or steals) a thread only when the borrowing/stealing processor's “run queue” is empty. The load average is then used at such instants to decide whether to allow the processor to borrow (or a steal) a thread from another processor.
Notably, the overall load of (i.e., number of threads executing on) MCM2120 is significantly lower than that of MCM1110. This imbalance is utilized to describe the load balancing process of the invention to relieve the load imbalances, specifically to relieve busy processor P1, without causing any significant long-term deterioration in the threads execution efficiency. The description of the present invention is thus presented to address a load imbalance across MCMs by implementing a borrowing algorithm, where appropriate, based on a load balancing analysis that takes into account the load relief available via a stealing algorithm.
Accordingly, a significant load average difference between two MCMs is used to determine when stealing is allowed. Lacking such a significant imbalance, borrowing will be allowed if the borrowing node has significant idle time (i.e., relatively small load average per processor) and the lending node does not have significant idle time. If a node has significant idle time, stealing of threads are done locally and no borrowing across-MCM is performed.
Features of the invention may generally be described with reference to
In
Then, in the third dispatch cycle 406 P1 again borrows a thread from P1. However, the thread (Th3) borrowed this time is different from the original thread (Th1) borrowed. Again, P6 releases the thread (Th3) back to P1 when the dispatch cycle ends. During dispatch cycle 4408 P6 receives its own thread to execute or receives a thread from the local MCM. P1 continues to execute its four threads, while P6 begins executing threads local to itself or to its MCM.
Returning to decision block 304, when the imbalance is not beyond the threshold required to initiate the stealing process, a next determination is made at block 310 whether the imbalance detected is at the cross-MCM borrowing threshold. When the threshold for borrowing is not surpassed, the load balancing process is ended at block 312. When the threshold is surpassed, however, the cross-MCM borrowing algorithm is activated and MCM-to-MCM borrowing of threads commences at dispatch cycle intervals, as shown at block 314. Unlike with the thread stealing algorithm, the memory locality, etc. of the borrowed thread are maintained at the MCM of the lending processor, as illustrated at block 316.
The process of
Returning to
When there are no busy local processors from which idle processor P6 can steal a thread, a next determination is made at block 206 whether there is a busy processor with available threads within MCM1110. The algorithm causes the idle processor P6 to continue scanning the MDPS until the idle processor P6 finds a thread to borrow or steal, or until the idle processor P6 is assigned a thread and is no longer idle.
When there is a thread available from a processor of MCM1110, the idle processor P6 receives the borrowed thread, and P6 executes the borrowed thread at block 212 during the dispatch cycle. Borrowing processor P6 arranges for all future data references of the borrowed thread that allocate memory do so locally to the lending processor during the dispatch cycle, in one embodiment, but does not move/change any of the previous allocation within the remote memory of MCM1110. Borrowing processor P6 thus treats the borrowed thread as if it were actually being run by the lending processor.
A check is made at block 214, just prior to completion of the dispatch cycle, whether the borrowing processor will become idle again (i.e., have idle processing cycles available for allocation to a thread). If processor P6 will become idle, the borrowing algorithm again conducts a scan of the MDPS for an available thread to borrow or steal. Notably, the idle processor P6 may steal, borrow, or ignore a thread waiting in another processor's run queue, depending on determined load values. However, the present invention addresses only the borrowing of threads.
The processor P6 will not become idle following the dispatch cycle if a normal thread is assigned to processor. The processor-assigned/scheduled thread (i.e., the local thread (which a stolen thread implicitly becomes) stays assigned to be run on the same processor, so that after each of its dispatch cycles, the thread will next be expected to run on its local processor (unless the processor becomes too busy and is forced to lend the thread to another idle processor, for example) as shown at block 216. When the normal thread is complete, the processor again goes into idle state, which is determined at block 218. Once processor P6 becomes idle, borrowing/stealing algorithm is triggered to automatically search for busy processors from which to borrow/steal threads for processor P6.
In one embodiment, encountering a page fault is treated as a terminating condition for the borrowed dispatch cycle if paging input/output (I/O) is required. An assumption is made that the thread is probably going to resume executing on the owning/lending processor after the page fault is resolved. Thus, wherever the thread next runs, the page will be made resident in memory local to the thread's home MCM (unless the thread is stolen by a processor in another MCM). (0046] As described in more detail below, there are two borrowing load average requirements: (1) the borrowing processor and its MCM overall must have “sufficient” anticipated spare time (cycles) to give away, and the lending processor and its MCM must not have “sufficient” anticipated spare time (cycles) to get to the thread soon.
Several additional important details of the implementation include:
Thus, the load average of a processor is determined by sampling the length of the queue of threads awaiting execution on that processor. With borrowing being an available option, the sample becomes: queue length+threads_sent_to other_PROCESSORs-B, where B is 1 only when the PROCESSOR is running a borrowed thread, and otherwise B is 0.
Benefits of the invention include the implementation of a new load balancing algorithm for MCM-to-MCM balancing that prevents long term degradation to the threads involved. In other words, the cross-MCM borrowing algorithm leads to a reducing of the penalty for any one thread. All threads are subject to share in temporary re-allocation during the load balancing, and system performance thus remains consistent. Also, in some instances borrowing assists a processor in substantially reducing the processor's backlog.
As a final matter, it is important that while an illustrative embodiment of the present invention has been, and will continue to be, described in the context of a fully functional data processing system with installed management software, those skilled in the art will appreciate that the software aspects of an illustrative embodiment of the present invention are capable of being distributed as a program product in a variety of forms, and that an illustrative embodiment of the present invention applies equally regardless of the particular type of signal bearing media used to actually carry out the distribution. Examples of signal bearing media include recordable type media such as floppy disks, hard disk drives, CD ROMs, and transmission type media such as digital and analogue communication links.
While the invention has been particularly shown and described with reference to an illustrative embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention. For example, while the invention is specifically described with the load balancing algorithm using the thread counts to calculate and maintain the load averages, one implementation may track the relative business of the processors (using some other mechanism other than number of threads in the respective queues) and utilize the busy parameters within the load balancing algorithm. Also, while described as an MCM-to-MCM operation, the invention is not limited to such architectures and may be implemented by a mechanism responsible for Non-Uniform Memory Access (NUMA) architectures.