This invention relates to the field of computer systems, and in particular to a method and program for efficiently accessing shared resources in a multiprocess, or multitask, system, such as a parallel processing system.
Resources within a multitask system are often configured to appear to be able to be available to multiple tasks simultaneously. A single network interface card on a node of a network, for example, provides a single communication channel to the network, but time-shares this channel among each of the tasks so that it appears that all of the tasks are communicating on the network ‘simultaneously’. In like manner, common memory, such as system memory, is time-shared among multiple tasks, and application memory is shared among multiple tasks in an application that is processed by multiple parallel tasks.
The sharing of memory, and other system resources, must be properly synchronized, to ensure that only one process is modifying or accessing the resource at the same time. For example, when a process adds an amount to a system variable, the process will typically read the value of the variable in the system memory, add the amount to it, and store the resultant sum to the system memory. If two processes each want to add an amount to the system variable, care must be taken so that each process retains control of the system memory from the time that the value is read until the time that the sum is stored. In like manner, if two processes write information/records to a file, care must be taken so that each process completes the writing of the entire record before the other process commences the writing of its record.
“Locks” are commonly provided by computer operating systems to prevent the simultaneous access to a resource by more than one task. When a task “locks” a resource, the operating system prevents other tasks from accessing the resource until the task “unlocks” the resource. The term “mutex” is commonly used to describe a mutual exclusion program object that allows a task to lock an associated resource to prevent other tasks from accessing the resource. Typically, a mutex bit is associated with each lockable resource; if the bit value is zero, the resource is unlocked, otherwise, it is locked. The task that sets the mutex to one is the only task that is permitted to set the mutex to zero.
In a straightforward embodiment, when a task desires access, it continually loops until the mutex bit is zero, then sets the bit to one, performs its intended process with the resource, then sets the bit to zero. Such a lock is termed a “spinlock”, in that it requires requesting tasks to loop, or “spin”, while waiting for the resource to be unlocked. This spinning, however, consumes processing time, as the processor repeatedly reads the bit to determine when the resource is unlocked. If other tasks subsequently attempt to access the resource, they will also place themselves in a spin mode, continually checking the status of the lock bit. In a simple time-slice multitasking system, if N tasks out of M total tasks are waiting for the resource, N/M of the total CPU cycles will be consumed in merely reading the lock bit, as each of the N tasks merely spin during their allocated time-slice.
To avoid the inefficiencies of spinlocks, conventional operating systems provide mechanism for queuing tasks that are waiting for a locked resource. When a task attempts to access a mutex-locked resource, the operating system detaches the task from execution (“parks” the task), and places the task in a first-in-first-out queue. When the resource is unlocked, by the task that initially locked the resource, the next task in the queue is reactivated (“unparked”), granted access to the resource, and the resource is again locked. In this manner, if N out of M tasks are waiting for the resource, they will be parked, and the CPU cycles will be allocated among the M−N active/unparked tasks.
Generally, the parking of tasks that are awaiting access to a locked resource is performed automatically by a multitask processor, and is transparent to the application-level program. For the purposes of this disclosure, the term “native mutex” is used hereinafter to define a mutex scheme that is provided by an operating system to provide CPU-efficient access control to a resource by automatically parking tasks that are waiting to access a currently-accessed resource.
The automatic parking of native mutex schemes also allows the multitask processor to allocate access to a resource fairly, or to allocate access to the resource based on a priority scheme, and so on. U.S. Pat. No. 6,480,918, “LINGERING LOCKS WITH FAIRNESS CONTROL FOR MULTI-NODE COMPUTER SYSTEMS”, 12 Nov. 2002; U.S. patent application Publication Ser. No. 2003/0041183 “SYNCHRONIZATION OBJECTS FOR MULTI-COMPUTER SYTEMS”, 27 Feb. 2003; and U.S. patent application Publication Ser. No. 2003/0131168, “ENSURING FAIRNESS IN A MULTIPROCESSOR ENVIRONMENT USING HISTORICAL ABUSE RECOGNITION IN SPINLOCK ACQUISITION”, 10 Jul. 2003, are examples of embodiments of native mutex schemes in conventional operating systems, and are each incorporated by reference herein.
Although native mutex schemes provide for overall CPU efficiency, they do not necessarily result in performance efficiency for a given application program, due to the overhead associated with the parking/unparking process. In some instances, an application program may experience a 10:1 or even 100:1 degradation in speed due to native mutex conflicts. In some applications, such as real-time processing, such degradation may prevent the application from performing its function, and in other applications, such as the simulation of complex systems, such degradation may extend the elapsed time beyond feasible limits. Although a priority-based mutex scheme may alleviate some of this degradation, the improvement in performance provided by a higher priority may not be sufficient to provide adequate performance. Additionally, a priority-based system is generally ineffective if the multiple tasks that are competing for the resource are associated with a single application on a parallel processor system, because the priority is generally allocated per application, not per sub-task within an application.
In many instances, applications that require efficient processing must forego the advantages provided by conventional operating systems, because of the side-effects caused by native functions within the operating system, such as the side-effect of queuing and parking produced by the operating system's implementation of a “fair” resource sharing technique. In such instances, the developer must either find another operating system that does not have the particular side effect that degrades the application program's performance, or must custom design an operating system to avoid such side effects.
An objective of this invention to provide a means of avoiding the inefficiencies and overhead associated with native mutexes of conventional operating systems. A further objective of this invention is to provide a means of avoiding the inefficiencies associated with native mutexes without requiring major changes to application programming techniques. It is a further objective of this invention to provide a means of automatically improving the performance of existing application programs.
These objectives, and others, are achieved by embedding native mutex locks within an application-controlled lock. Each of these locks are applied to the same resource, in such a manner that, in select applications, and particularly in parallel processed applications, the adverse effects of the inner native mutex lock are avoided. In a preferred embodiment, each call to a system routine that is known to invoke a native mutex is replaced by a call to a corresponding routine that spinlocks the resource before calling the system routine that invokes the native mutex, then releases the spinlock when the system call is completed. By locking the resource before the native mutex is invoked, the calling task is assured that the resource is currently available to the task when the native mutex is invoked, and therefore the task will not be parked by the native mutex.
The invention is explained in further detail, and by way of example, with reference to the accompanying drawings wherein:
Throughout the drawings, the same reference numerals indicate similar or corresponding features or functions. The drawings are included for illustrative purposes and are not intended to limit the scope of the invention.
By way of background, the conventional malloc function 120 allocates a block of system memory (sysmem) to a process 110 upon request for a desired size of the memory block. All tasks that require memory allocation from the system memory call this function 120. A pointer (alloc_ptr) is maintained by the system that controls the memory, and is configured to point to the next available unallocated memory location. Assuming a sequential allocation of memory, the start of the allocated memory block (memstart) will be the pointer's current value, at 122, and the pointer will be advanced by the size of the allocated block, at 123, in preparation for the next call for memory allocation, by the same task or any other task. Note that if another task simultaneously calls for a memory allocation, between steps 122 and 123, and access to the allocation pointer (alloc_ptr) is not controlled, this other task would read the same value from alloc_ptr as the first task, and both tasks would use that location as the start of its allocated memory. To prevent the allocation of the same memory to multiple tasks, the allocation pointer (alloc_ptr) is controlled within the malloc function by a native mutex function, at 121 and 124.
At 121, the mutex_acquire function is called, to request a lock on the system memory, the resource to which the allocation pointer (alloc_ptr) is associated. As discussed above, and as detailed below with regard to
At 124, the mutex_release function is called, to release the lock on the system memory. If any tasks remain in the queue, the system memory is assigned to the next task in the queue, or to a task in the queue that is given higher priority than the default first-in first-out queuing scheme.
Note that the example malloc function 120 assures that only one task accesses the system memory at any given time, independent of the particular calling task 110. Other system-provided or library-provided functions employ similar techniques for protecting shared resources from simultaneous use. The calling task 110 has no control over how the exclusive control of the resource is provided by the provided function 120, and thus cannot directly overcome any inefficiencies that the function 120 may introduce to the calling tasks. As noted above, a variety of schemes have been proposed for assuring that multiple tasks are given an equal opportunity to access each resource, or, in the case of priority-based queue processing, that high priority tasks are given appropriately more or quicker access to each resource, but these schemes are also beyond the tasks' direct control, so that if inefficiencies result, the conventionally programmed tasks have no direct means of avoiding such inefficiencies.
The parallel processing of a multitask process often suffers from the inefficiencies of the use of native mutex techniques to control access to a shared resource, due to the overhead associated with parking and unparking tasks that call for access to currently-locked resources. Consider, for example, partitioning a multitask process having M tasks that are distributed among N processors that operate in parallel and share a common resource, such as allocate-able system memory. If K tasks request access to the resource concurrently, K−1 tasks will be put on the queue and deactivated/parked, leaving M−(K−1) active tasks. If the number of remaining active tasks is equal to or greater than N, then the N processors will be productively used. If, on the other hand, the number of active tasks is less than N, a number (N−(M−(K−1))) of processors will be unused, and thus the overhead associated with parking and unparking the (N−(M−(K−1))) tasks will have been needlessly incurred.
Consider also a single application running on N processors with a sufficient number of tasks M to keep the N processors occupied continuously. Assume that, on average, there are L concurrent requests for a particular single-access asset, that each access incurs T1 time units, and that the parking/unparking tasks incurs T2 time units. Without a native mutex, each of the L concurrent requests will wait (L−1)*T1 time units before gaining access to the asset. While each of the L tasks are waiting, L other tasks will not be processed by the processors that are being used for these L tasks. With a native mutex, L−1 of these other tasks will be processed while L−1 tasks are parked. The cost of parking/unparking these L−1 tasks is (L−1)*T2 time units, and the gain in processing the other tasks will be (L−1)*(L−1)*T1 time units. Therefore, if (L−1)2*T1 is greater than (L−1)*T2, an overall gain is achieved; otherwise, the parking/unparking overhead exceeds the gain provided by the native mutex. Stated in another way, if the average number of concurrent accesses L is greater than one and less than (T2+T1)/T1, the parking/unparking overhead caused by the native mutex will result in an overall inefficiency.
As noted above, an objective of this invention is to avoid the inefficiencies that are introduced by native mutex schemes. As also noted above, however, many system and library functions contain calls to native mutex function, and these system and library functions, as well as the native mutex functions, are beyond the direct control of an application program developer.
In accordance with a first aspect of this invention, each system or library function that employ a native mutex process that causes inefficiencies, or is expected to cause inefficiencies, due to the parking and unparking of active process, is encapsulated within another function that is specifically designed to prevent the native mutex process from parking the active process.
The encapsulating function s_malloc 220 performs the same operational function as the replaced function 120 in the application program 210, and thus the operation of the application program 210 in
When a native mutex control technique 300 receives a request to acquire a mutex resource from a particular task, the resource is checked to determine whether it is already locked, at 310. If, at 310, the resource is not locked, the resource is locked for use by the requesting task, at 320, and control returns to the calling routine, at 340 (in the example of
If the resource is locked by another task, an identification of the requesting task is placed in an access queue for this resource, at 350, and the task is deactivated/parked, at 360. Control does not return to the calling routine until this task rises to the top of the queue, the resource becomes unlocked from its prior task and locked to this task, and the task is reactivated/unparked (not illustrated). As noted above, this queuing 350, parking 360, and unparking (not illustrated) process can introduce a significant degradation in the performance of applications that frequently seek access to shared resources, because generally these processes consume orders of magnitude more time than the locking 320 and unlocking (not illustrated) processes that are invoked when the resource is immediately available for locking by the requesting task.
For the purposes of this disclosure, the term “spinlock” is defined as any locking scheme that facilitates the locking of a resource to a requesting task without the possibility of deactivating or parking the requesting task. Conversely, the term “native mutex”, or “native mutex lock” is defined as any locking scheme that facilitates the locking of a resource to a requesting task when the resource next becomes available, and also facilitates the queuing and deactivation of the task while the resource is unavailable. In accordance with this invention, a spinlock 400 is placed before a call to a system function 450 that includes a call to a native mutex lock 300 that is expected to degrade the performance of the application by parking and unparking tasks within the application.
The spinlock_acquire function 400 initially determines whether the requested resource is locked, at 410, and if it is not currently locked, locks the resource to the requesting task, at 420, and returns control to the calling program (e.g. 220 in
Note that this spinlock process 400 does not place the calling task in a queue, and does not park the task until the resource becomes available. In principle, this spinlock process could lead to program inefficiency, because the requesting task competes with every other process that is attempting to access the resource, and there is no guarantee that the requesting task will ever get out of the loop 410-430. However, in certain applications, discussed below, this spinlock process 400 increases the program efficiency by preventing the subsequently called native mutex lock 300 from parking the requesting task.
Upon acquiring spinlock to the resource, the original system function call 450 (malloc, in the example call at 222 in
Because the call to the native_mutex_acquire function 300 occurs after the resource is locked to the requesting task by the spinlock function 400, the “resource locked?” test, at 310, must result in a “yes”, and the “locked by this task”, at 330, must also result in a “yes”, thereby preventing a branch to the queuing and parking steps 350-360 that produce program inefficiencies. Therefore, with reference to
As noted above, the use of the spinlock function 400 can result in program inefficiencies, particularly if the requested resource is continually requested by many other competing processes. However, there are particular situations wherein the use of the spinlock 400 before a call to a native mutex 300 can provide significant performance improvements.
Of particular note, consider an application program that is executed on multiple processors using parallel processing techniques. Often, such parallel processing is performed because the application program requires it to perform its task properly (e.g. real time processing systems), or because the turn-around time of the application program using a single processor would prove impractical (e.g. simulation of large systems). Generally, because of the need for fast processing, these applications are given priority over other processes that are run on the parallel-processing system, and/or are run alone, or almost alone, on the system. In these situations, the application program primarily competes with itself for access to common resources, in that the only tasks, or the large majority of tasks, that are competing for the resource are the sub-tasks of the application program that are each being run as a parallel task.
If a particular resource is “saturated” or “over tasked”, i.e. there are more requests for the resource per unit time than the system can provide, or near being over tasked, the use of a spinlock 400 as taught in this disclosure will, in general, degrade the performance of the application. If, on the other hand, the particular resource is “moderately tasked”, or “lightly tasked”, the use of the spinlock 400 to encapsulate system calls as taught in this disclosure can be expected to substantially improve the performance of the application, by avoiding the queuing and parking of sub-tasks when the resource is temporarily unavailable.
This invention can be embodied in an existing application in a relatively straightforward manner. When a particular system routine is identified as being the cause of inefficiencies related to native mutex queuing and parking, the source code of the application program can be searched for each call to the system routine, and replaced by a substitute call to the routine that encapsulates the system routine within a spinlock. The encapsulating routine is created within the application program and/or within a supporting library of subroutines and functions, and the original calls to the system routine are replaced by calls to this encapsulating routine. In the example of
Alternatively, if the source code of the existing application is not available for modification, or not permitted to be modified, the object code of the application can be amended by replacing each branch to the address of the system routine by a branch to the address of the encapsulating routine. In like manner, the symbolic address of the system routine can be mapped to the address of the encapsulating routine in the linker/loader that is used to create the object code from the compiled code. These and other techniques for replacing calls to a given system routine to a routine that encapsulates the system routine within a spinlock will be evident to one of ordinary skill in the art.
The foregoing merely illustrates the principles of the invention. It will thus be appreciated that those skilled in the art will be able to devise various arrangements which, although not explicitly described or shown herein, embody the principles of the invention and are thus within the spirit and scope of the following claims.
In interpreting these claims, it should be understood that:
This application claims the benefit of U.S. Provisional Application 60/497,714 filed 25 Aug. 2003.
Number | Date | Country | |
---|---|---|---|
60497714 | Aug 2003 | US |