1. Technical Field
The present invention is directed to resource allocations in a computer system. More specifically, the present invention is directed to a system, apparatus and method of reducing adverse performance impact due to migration of processes from one CPU to another.
2. Description of Related Art
At any given processing time, there may be a multiplicity of processes or threads waiting to be executed on a processor or CPU of a computing system. To best utilize the CPU of the system then, it is necessary that an efficient mechanism that properly queues the processes or threads for execution be used. The mechanism used by most computer systems to accomplish this task is a scheduler.
Note that a process is a program. When a program is executing, it is loosely referred to as a task. In most operating systems, there is a one-to-one relationship between a task and a program. However, some operating systems allow a program to be divided into multiple tasks or threads. Such systems are called multithreaded operating systems. For the purpose of simplicity, threads and processes will henceforth be used interchangeably.
A scheduler is a software program that coordinates the use of a computer system's shared resources (e.g., a CPU). The scheduler usually uses an algorithm such as a first-in, first-out (i.e., FIFO), round robin or last-in, first-out (LIFO), a priority queue, a tree etc. algorithm or a combination thereof in doing so. Basically, if a computer system has three CPUs (CPU1, CPU2 and CPU3), each CPU will accordingly have a ready-to-be-processed queue or run queue. If the algorithm in use to assign processes to the run queue is the round robin algorithm and if the last process created was assigned to the queue associated with CPU2, then the next process created will be assigned to the queue of CPU3. The next created process will then be assigned to the queue associated with CPU1 and so on. Thus, schedulers are designed to give each process a fair share of a computer system's resources.
In certain instances, however, it may be more efficient to bind a process to a particular CPU. This may be done to optimize cache performance. For example, for cache coherency purposes, data is kept in only one CPU's cache at a time. Consequently, whenever a CPU adds a piece of data to its local cache, any other CPU in the system that has the data in its cache must invalidate the data. This invalidation may adversely impact performance since a CPU has to spend precious cycles invalidating the data in its cache instead of executing processes. But, if the process is bound to one CPU, the data may never have to be invalidated.
In addition, each time a process is moved from one CPU (i.e., a first CPU) to another CPU (i.e., a second CPU), the data that may be needed by the process will not be in the cache of the second CPU. Hence, when the second CPU is processing the process and requests the data from its cache, a cache miss will be generated. A cache miss adversely impacts performance since the CPU has to wait longer for the data. After the data is brought into the cache of the second CPU from the cache of the first CPU, the first CPU will have to invalidate the data in its cache, further reducing performance.
Note that when multiple processes are accessing the same data, it may be more sensible to bind all the processes to the same CPU. Doing so guarantees that the processes will not contend over the data and cause cache misses.
Thus, binding processes to CPUs may at times be quite beneficial.
When a CPU executes a process, the process establishes an affinity to the CPU since the data used by the process, the state of the process etc. are in the CPU's cache. This is referred to as CPU affinity. There are two types of CPU affinity: soft and hard. In hard CPU affinity, the scheduler will always schedule a particular process to run on a particular CPU. Once scheduled, the process will not be rescheduled to another CPU even if the CPU is busy while other CPUs are idle. By contrast in soft CPU affinity, the scheduler will first schedule the process to run on a CPU. If, however, the CPU is busy while others are idle, the scheduler may reschedule the process to run on one of the idle CPUs. Thus, soft CPU affinity may sometimes be more efficient than hard CPU affinity.
However, since when a process is moved from one CPU to another, performance may be adversely affected, a system, apparatus and method are needed to circumvent or reduce any adverse performance impact that may ensue from moving a process from one CPU to another as is customary in soft CPU affinity.
The present invention provides a system, apparatus and method of reducing adverse performance impact due to migration of processes from one processor to another in a multi-processor system. When a process is executing, the number of cycles it takes to fetch each instruction (CPI) of the process is stored. After execution of the process, an average CPI is computed and stored in a storage device that is associated with the process. When a run queue of the multi-processor system is empty, a process may be chosen from the run queue that has the most processes awaiting execution to migrate to the empty run queue. The chosen process is the process that has the highest average number of CPIs.
In one embodiment, the number of cycles it takes to fetch each piece of data is stored in the storage device rather than the average CPI. This number is averaged out at the end of the execution of the process and the average is used to select a process to migrate from a run queue having the highest number of processes awaiting execution to an empty run queue.
In another embodiment, both average CPI and cycles per data are used in determining which process to migrate. Particularly, when processes that are instruction-intensive are being executed, the average CPI is used. If instead data-intensive processes are being executed, the average number of cycles per data is used. In cases where processes that are neither data-intensive nor instruction-intensive are being executed, both the average CPI and the average number of cycles per data are used.
The novel features believed characteristic of the invention are set forth in the appended claims. The invention itself, however, as well as a preferred mode of use, further objectives and advantages thereof, will best be understood by reference to the following detailed description of an illustrative embodiment when read in conjunction with the accompanying drawings, wherein:
a depicts run queues of the multi-processor system with assigned processes.
b depicts the run queues after some processes have been dispatched for execution.
c depicts the run queues after some processes have received their processing quantum and have been reassigned to the respective run queues of the processors that have executed them earlier.
d depicts the run queues after some time has elapsed.
e depicts the run queue of one of the processors empty.
f depicts the run queues after one process has been moved from run queue to another run queue.
Connected to system bus 109 is memory controller/cache 111, which provides an interface to shared local memory 109. I/O bus bridge 110 is connected to system bus 109 and provides an interface to I/O bus 112. Memory controller/cache 111 and I/O bus bridge 110 may be integrated as depicted.
Peripheral component interconnect (PCI) bus bridge 114 connected to I/O bus 112 provides an interface to PCI local bus 116. A number of modems may be connected to PCI local bus 116. Typical PCI bus implementations will support four PCI expansion slots or add-in connectors. Communications links to a network may be provided through modem 118 and network adapter 120 connected to PCI local bus 116 through add-in boards.
Additional PCI bus bridges 122 and 124 provide interfaces for additional PCI local buses 126 and 128, from which additional modems or network adapters may be supported. In this manner, data processing system 100 allows connections to multiple network computers. A memory-mapped graphics adapter 130 and hard disk 132 may also be connected to I/O bus 112 as depicted, either directly or indirectly.
Those of ordinary skill in the art will appreciate that the hardware depicted in
The data processing system depicted in
The operating system generally includes a scheduler, a global run queue, one or more per-processor local run queues, and a kernel-level thread library. In this case, since only the per-processor run queues are needed to explain the invention only those will be shown.
According to the content of the run queues, the scheduler has already assigned threads Th1, Th5, Th9 and Th13 to CPU1 202. Threads Th2, Th6, Th10 and Th14 have been assigned to CPU2 204 while threads Th3, Th7, Th11, and Th15 have been assigned to CPU3 206 and threads Th4, Th8, Th12 and Th16 has been assigned to CPU4 208.
In order to inhibit one thread from preventing other threads from running on an assigned CPU, each thread has to take turns running on the CPU. Thus, another duty of the scheduler is to assign units of CPU time (e.g., quanta or time slices) to threads. A quantum is typically very short in duration, but threads receive quanta so frequently that the system appears to run smoothly, even when many threads are performing work.
Every time one of the following situations occurs, the scheduler must make a CPU scheduling decision: a thread's quantum on the CPU expires, a thread waits for an event to occur and a thread becomes ready to execute. In order not to obfuscate the disclosure of the invention, only the case where a thread's quantum on the CPU expires will be explained. However, it should be understood that the invention may apply to the other two cases equally.
Suppose, Th1, Th2, Th3 and Th4 are dispatched for execution by CPU1, CPU2, CPU3 and CPU4, respectively. Then the run queue of each CPU will be as shown in
Since Th1 ran on CPU1 then any data as well as instructions that it may have used while being executed will be in the integrated L1 cache of processor 101 of
Suppose after some time has elapsed and after some threads have terminated etc. the run queues of the CPUs are populated as shown in
According to the invention, after a thread has run on a CPU, some statistics about the execution of the thread may be saved in the thread's structure. For example, the number of instructions that were found in the caches (i.e., L1, L2 etc.) as well as RAM 109 may be entered in the thread's structure. Likewise the number of pieces of data found in the caches and RAM is also stored in the thread's structure. Further, the number of cache misses that occurred (for both instruction and data) during the execution of the thread may be recorded as well. Using these statistics, a CPI (cycles per instruction) may be computed. The CPI reveals the cache efficiency of the thread when executed on that particular CPU.
The CPI may be used to determine which thread from a group of threads in a local run queue to reassign to another local run queue. Particularly, the thread with the highest CPI may be re-assigned from one CPU to another with the least adverse impact on performance since that thread already had a low cache efficiency. Returning to
f depicts the run queues of the CPUs after thread Th1 is reassigned to CPU4.
In some instances, instead of using the CPI number to determine which thread to migrate from one queue to another run queue, a different number may be used. For example, in cases where instruction-intensive threads are being executed, the instruction cache efficiency of the thread may be used (i.e., the number of CPU cycles it takes for the CPU to obtain an instruction from storage). Likewise, in cases where data-intensive threads are being executed, the data cache efficiency of the thread may be used (i.e., the number of CPU cycles it takes the CPU to obtain data from storage).
If data is to be fetched instead of instructions, the first piece of data will be fetched (step 312). As the data is being fetched, the number of cycles it actually takes to obtain the data will be counted and recorded. If there is more data to fetch the next piece of data will be fetched otherwise the process will return to step 302. The process ends when execution of the thread has terminated (steps 314, 316 and 318).
The description of the present invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art. For example, threads of fixed priorities may be used rather than of variable priorities. Thus, the embodiment was chosen and described in order to best explain the principles of the invention, the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.
This application is related to co-pending U.S. patent application Ser. No. ______ (IBM Docket No. AUS920040033), entitled SYSTEM, APPLICATION AND METHOD OF REDUCING CACHE THRASHING IN A MULTI-PROCESSOR WITH A SHARED CACHE ON WHICH A DISRUPTIVE PROCESS IS EXECUTING, filed on even date herewith and assigned to the common assignee of this application, the disclosure of which is herein incorporated by reference.