The subject matter described herein relates to techniques for selectively allocating and deallocating memory on a processor-by-processor basis in a multi-threaded multi-processor computing system.
Memory management acts to dynamically allocate portions of memory to processes at request and to free such portions of memory when they are no longer needed. Memory management scalability is particularly critical in systems simultaneously executing numerous threads.
Some memory management systems utilize global locks across numerous processors which seriously limit scalability and thus throughput. Other memory management systems that associate memory to threads necessarily have a high overhead when a lot of threads are running because each thread has its private memory pool.
In one aspect, allocators are instantiated for each of a plurality of processors in a multi-threaded multi-processor computing system. These allocators selectively allocate and deallocate, for each processor, memory to threads executing on the processor.
The allocator can be associated with a lock protecting an internal structure of the allocator. If a first allocator is not able to allocate requested memory, the first allocator can unlock its lock and request one or more other allocators to execute the requested memory allocation under their respective lock. The first allocator can repeatedly request the one or more other allocators to execute the requested memory allocation until the request is fulfilled. A second allocator can fulfill the request of the first allocator. The second allocator can be associated with a memory block allocated to fulfill the request. An out-of-memory error can be returned by the first allocator to a corresponding caller if no other allocator can fulfill the request.
At least one allocator can reference a particular memory block. A pointer of the allocator can be stored in a header of a memory block prior to an actual data area of the block that is going to be used by a corresponding caller. The pointer of the associated allocator can be stored external to the memory block. A memory block can be deallocated by its associated allocator under the lock of the allocator if the thread being deallocated runs on the same processor to which this allocator is associated. Deallocated memory blocks can be put to a memory free list upon deallocation with the free list identifying memory available for allocation.
In an interrelated aspect, an associated allocator is instantiated for each of a plurality of processors in a multi-threaded multi-processor computing system, an associated allocator. Thereafter, it is determined, by a first allocator, that it cannot allocate memory. The first allocator then polls a plurality of second allocators to identify one of the second allocators that can allocate memory. The identified second allocator can then allocate memory on behalf of the first allocator.
Articles of manufacture are also described that comprise computer executable instructions permanently stored on computer readable media, which, when executed by a computer, causes the computer to perform operations herein. Similarly, computer systems are also described that may include a processor and a memory coupled to the processor. The memory may temporarily or permanently store one or more programs that cause the processor to perform one or more of the operations described herein. In addition, methods can be implemented by one or more data processors either within a single computing system or distributed among two or more computing systems.
The subject matter described herein provides many advantages. For example, the techniques described herein enable more rapid memory management in a multi-CPU environment in which many concurrent threads are running. In addition, the current subject matter is advantageous that it can be implemented in systems that cannot be re-written to support cooperative multitasking (e.g., systems with legacy code, etc.), but it scales practically linearly with an increasing number of CPUs while having an overhead comparable to cooperative multitasking allocators.
The details of one or more variations of the subject matter described herein are set forth in the accompanying drawings and the description below. Other features and advantages of the subject matter described herein will be apparent from the description and drawings, and from the claims.
Memory allocation. Each allocator 210i . . . n can be associated with a mutual object exclusion list (mutex), which in turn can be locked during actual memory allocation. The lock can protect against preemption inside of allocation code (as described below). Inside locked scope, possibly some cleanup is done (see memory deallocation below) and then the allocation itself is executed. Allocator 210i . . . n allocating the memory block is associated with the memory block (see deallocation for details), and memory can be returned to the caller (i.e., a thread). As only one thread uses this particular CPU 220i . . . n at any given time, normally no lock conflict occurs, except when the memory allocation is pre-empted and CPU 220i . . . n starts processing another thread, which by chance also wants to allocate memory. As memory allocation routines are highly optimized, the locked scope duration is very short and thus the probability of lock collision is negligible.
In case memory gets very low (i.e., the amount of available memory has been reduced based on consumption by the thread(s)), a CPU-specific allocator 210i . . . n may not be able to fulfill an allocation request. In this case, the allocator 210i . . . n can attempt to “borrow memory” from other CPU-specific allocators 210i . . . n. First, the allocator 210i . . . n unlocks its lock (allocation is not executed by this allocator) and then requests one or more of the other CPU-specific allocators 210i . . . n to execute the allocation (under its lock), until it can be fulfilled. If another allocator 210i . . . n can fulfill the allocation request, then this other allocator 210i . . . n will be associated with the newly-allocated memory block, so the deallocation knows which allocator 210i . . . n allocated the block. If no other allocator 210i . . . n can fulfill the allocation request, then an out-of-memory error can be returned to the caller.
Memory deallocation: Memory management subsystem provides an efficient way to determine the allocator 210i . . . n which allocated a given memory block (i.e., allocating allocator 210i . . . n is associated with the memory block). This can be realized, for example, by storing a pointer to the allocating allocator 210i . . . n inside of the memory block header before user-specific data (usual method of storing per-block metadata in memory allocators). In some implementations, out-of-block metadata storage can be used to limit the possibility that internal allocator structures are overwritten/destroyed due to invalid use of memory (e.g., overwrite before or past the actual user data block returned by the allocator and the like).
When a deallocation is requested, there are two possibilities. In a first case, the memory block to deallocate was allocated by the same CPU-specific allocator 210i . . . n and so this memory block can, for example, be directly deallocated under the lock of this CPU-specific allocator 210i . . . n. Alternatively, the deallocated memory can be put to a free list (which is used to collect deallocated memory for delayed deallocation from other CPUs 220i . . . n into this allocator). In a second case, the memory block was allocated by a different CPU-specific allocator 210i . . . n (i.e., on different CPU 220i . . . n. In this situation, the memory block is not directly deallocated, but it can be put into a lock-free freelist of “delayed deallocate” blocks (attached to the CPU-specific allocator 210i . . . n which allocated this memory block). Thus, no locking takes place during deallocation.
The actual deallocation (“cleaning up”) of blocks of a particular CPU-specific allocator 210i . . . n can be done when the next allocation is requested in this CPU-specific allocator 210i . . . n (see memory allocation above), or a garbage collection is explicitly requested (which also locks the allocator). As the memory blocks are usually short-lived and are deallocated by the same thread, which allocated the block, the probability of deallocating on the same CPU 220i . . . n (threads are generally only seldom migrated between CPUs 220i . . . n) is very high and L2-cache collisions on the CPU-specific allocator freelist are also negligible.
The allocators 210i . . . n can optimally support allocation and deallocation of memory blocks of various size. Allocator techniques such as Doug Lea allocators, slab allocators, buddy allocators, and the like can be used.
Aspects of the subject matter described herein may be realized in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various implementations may include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.
These computer programs (also known as programs, software, software applications or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the term “machine-readable medium” refers to any computer program product, apparatus and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term “machine-readable signal” refers to any signal used to provide machine instructions and/or data to a programmable processor.
The subject matter described herein may be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a client computer having a graphical user interface or a Web browser through which a user may interact with an implementation of the subject matter described herein), or any combination of such back-end, middleware, or front-end components. The components of the system may be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network (“LAN”), a wide area network (“WAN”), and the Internet.
The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
Although a few variations have been described in detail above, other modifications are possible. For example, the logic flow depicted in the accompanying figures and described herein do not require the particular order shown, or sequential order, to achieve desirable results. Other embodiments may be within the scope of the following claims.