The present invention is generally related to synchronizing access to shared data in a multicore processor environment. More particularly, the present invention is directed to the use of single-threaded (i.e., uniprocessor) legacy code in a multicore processor environment.
Software that is designed to run on multicore and manycore processors (also known as Chip Multiprocessors—CMP) must be explicitly structured for correctness in the presence of concurrent execution. Most multicore processors today support coherent shared memory (known as SMP—Symmetric Multi-Processing) which allows multiple threads of execution, running on separate cores (on potentially separate processor packages), to access the same physical memory space. Coherency, meaning that a consistent view of memory is observed by all cores, is achieved typically through hardware coherency protocols at the cache level (e.g., Modified Exclusive Shared Invalid (MESI)).
An important element of correctness within an SMP environment is ensuring that accesses to data are serialized in order to ensure atomicity in data writes. For example, given Thread A (running on core 0) is writing a 64-bit integer (e.g., variable v) in memory (on a 32-bit machine), two memory operations/transactions on the memory controller, are needed to achieve the goal. Without correct synchronization, a Thread B might read the first half of the memory location before Thread A has completed writing the second half results in an inconsistent and incorrect result. To avoid this problem, read and write access to variable ‘v’ should be synchronized through some form of lock or mutual exclusion mechanism (e.g., spinlock, mutex, semaphore) that can be realized on the specific processor.
However, due to the relatively recent advent of multicore processors, legacy library code is usually not “thread safe.” This is because legacy library code was often originally designed to execute only in a uniprocessor environment. There are several possible solutions to the reuse of uniprocessor code in a multicore processor environment. These include: 1) rewriting legacy code, which has the disadvantages of requiring a time consuming process in which the source code may not be available; and 2) placing a lock around every legacy/library call to serialize the execution of every call, even those that cannot cause race conditions.
The present invention is directed to using a page-fault mechanism to safely manage access to shared data across multiple concurrent threads of an SMP environment. An exemplary application is synchronizing access to shared data in a legacy library, where data that is stored and manipulated in a region of memory that has an unknown (opaque) structure although the size of the region is known. The system's memory page-fault handling mechanism is used to transparently synchronize access to shared (heap) data at the page level. In one implementation the system's page-fault handler only allows serialized access to any heap and global data in a legacy library. Threads running on separate cores that attempt to access potentially-shared data, while ownership is already taken by other threads, will wait on a lock within the page-fault handler until the owning thread has given up (yielded) control. Once the lock has been yielded, one of the threads waiting for access to the shared data is then released.
The present invention is generally directed to an apparatus, system, method and computer program product to safely share complex data across multiple concurrent threads in a multicore processing environment without explicit placement and use of conventional locks. An exemplary application of the present invention is synchronizing access to shared data in a legacy library. Referring to
In one embodiment the OS must define a Potentially Shared Data (PSD) region. For global data, a loader will find the memory region of data and bss section when loading the library. For heap data, the original ‘malloc’ calls in the library are redirected to a specialized form of ‘malloc’, which herein we term ‘shmalloc’. Shmalloc can allocate memory from a special shared memory region that is defined by OS. Page faults associated with shared memory are differentiated by the OS page-fault handler by examination of the faulting address.
Referring to
Embodiments of the present invention are directed to eliminating the requirement for conventional data-specific locks, which have problems dealing with PSD. A page-based approach is used in which the memory page-fault handling mechanism (which is a part of existing processors and operating systems) is adapted to transparently synchronize access to shared data at the page level. Accesses to Potentially Shared Data (e.g., PSD) are synchronized by using page-fault mechanisms to serialize access. As a result, one application of the present invention is that the page-based approach may be used to safely share shared memory across multiple processors where either 1) the structure of the shared data is unknown; or 2) the code that accesses the data cannot be directly modified (e.g., reusable library).
In one embodiment of the invention, a system's page-fault handler only allows serialized access to any memory page that makes up the memory store for PSD. Threads running on separate cores that attempt to access PSD, while ownership is already taken by other threads, will wait on a lock within the page-fault handler until the owning thread has given up (yielded) control. Once the control has been yielded, a thread waiting for access to the shared data is then released.
As illustrative examples, the processor H/W Page-Fault (PF) mechanisms may be based on conventional general purpose processors such as those found on the Intel x86®, ARM® and other general purpose processors adapted to schedule serialized access to the complex shared data. While a thread is already writing to a memory page within the memory holding the shared data, all other threads are held on a lock which we term the “PSD lock”. This is a logical lock for all pages that belong to the same shared memory area. One aspect of the present invention is that once a thread has taken ownership of a PSD lock, all other threads can be forced to page-fault on an attempted access and page making up the PSD. As described below in more detail, the scheduling may be implemented in different ways. One implementation replicates the page directory and page tables. A different present bit may be presented to individual threads to force a page-fault, where the present bit is a feature of Intel-based processors. In an alternate implementation, processor hardware is used to support software-managed Translation Lookaside Buffer (TLB caches) and the ability to explicitly load individual TLB entries. In this implementation codified hardware is used to load an inconsistent TLB entry into the owning core (and thus avoiding the owning thread faulting on accesses to the page that it owns).
In the present invention, it is necessary for a thread to explicitly yield access to a PSD region. A preferred embodiment could use the library function call return point to “hook” in an explicit yield operation.
A preferred embodiment is based on duplication of page table entries. Referring to
Referring to
The replication and separation approaches allows threads within the same shared memory process to have different page table entries for the same region of physical memory and thus have different Page Table Entry (PTE) status bits (e.g., reserved, present). The PTE status bits are used to force a thread accessing memory to trigger page-faults PSD and hence trap into the page-fault handler. Depending on the underlying hardware architecture, clearing the P (present) bit or setting an R (reserved) bit will ensure that a page-fault is generated on access to the respective memory. The use of an R-bit is preferred since this is not normally used for other purposes (e.g., the P-bit is used by the OS to manage paging). Herein we refer to PTE bit that is used to set or clear page-fault triggering the PFT status bit.
As previously described, both page directories and page tables are replicated for shared data. Page directories may be replicated within the operating system kernel during thread creation. The replication process copies an existing page directory from some other thread belonging to the process. All page directory entries are copied (i.e. the entry is duplicated using the same page table target) except those that are marked as pointing to a Page Table that contains a PTE (Page Table Entry) pointing to a shared memory page. In an Intel ia32 embodiment, one option is using one of the “ignored bits”, e.g., bit 9, in the PDE (Page Directory Entry) to indicate the presence of complex shared data in the target page table—herein we refer to this as the “PSD Page, PSDP-bit”. The PSDP-bit is set when a PTE is created for shared data. During the replication process, any page table that is marked as shared (via the PSDP-bit in the corresponding PDE) demands that a distinct (whole) copy of the page table is made in memory.
A preferred embodiment of the invention organizes virtual memory to separate complex shared data from non-shared data. That is, a page table will only contain pages that are shared or not shared, but not both. Referring to
Referring to
In this embodiment, the owning thread reads/writes the page using an inconsistent local TLB cache entry 610. The entry is inconsistent in that it does not match the state of the page table entry in main memory 605. Thus, other (non-owning) threads will page-fault when trying to access the same page, and thus can be synchronized in the page-fault handler as previously described in the primary embodiment, using the page table directory 615 and the PTE 620 PFT-bit 622. During ownership of the page, the owning thread can access the page using the TLB entry whilst the main memory page table entry 620 is set as not-present. The side-effect of the inconsistent TLB entry is that the owning threads will not page-fault for each read/write in the batch. The software-managed TLB capability allows the solution to directly upload entries into the TLB cache and thus achieve realize this inconsistency. When ownership is given up (yielded) the system must ensure that the local TLB entry for the respective PSD is also cleared.
While the present invention has been described in conjunction with specific embodiments, it will be understood that it is not intended to limit the invention to the described embodiments. On the contrary, it is intended to cover alternatives, modifications, and equivalents as may be included within the spirit and scope of the invention as defined by the appended claims. In the description, specific details are set forth in order to provide a thorough understanding of the present invention. The present invention may be practiced without some or all of these specific details. In addition, well known features may not have been described in detail to avoid unnecessarily obscuring the invention.
In accordance with the present invention, the components, process steps, and/or data structures may be implemented using various types of operating systems, programming languages, computing platforms, computer programs, and/or general purpose machines. In addition, those of ordinary skill in the art will recognize that devices of a less general purpose nature, such as hardwired devices, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), or the like, may also be used without departing from the scope and spirit of the inventive concepts disclosed herein. The present invention may also be tangibly embodied as a set of computer instructions stored on a computer readable medium, such as a memory device. In particular, methods of the present invention may be implemented as computer instructions stored on a computer readable medium. Moreover, as indicated by the previous discussion, certain aspects of the present invention may be implemented using computer program code stored in the main memory of a multicore processor system and executable by the operating system. Other features may be implemented by individual processing threads of individual processor cores.
The various aspects, features, embodiments or implementations of the invention described above can be used alone or in various combinations. The many features and advantages of the present invention are apparent from the written description and, thus, it is intended by the appended claims to cover all such features and advantages of the invention. Further, since numerous modifications and changes will readily occur to those skilled in the art, the invention should not be limited to the exact construction and operation as illustrated and described. Hence, all suitable modifications and equivalents may be resorted to as falling within the scope of the invention.
The present application claims the benefit of U.S. provisional App. Ser. No. 61/523,231, filed on Aug. 12, 2011, the contents of which are hereby incorporated by reference.
Number | Date | Country | |
---|---|---|---|
61523231 | Aug 2011 | US |