1. Field of the Invention
This disclosure relates to computing systems and more particularly to transactional memory operations in computing systems.
2. Description of the Related Art
Recent developments in computing have exploited parallelism, enabling faster computational processes. For example, a processor may include multiple processing cores that each executes instructions in parallel. However, the cores can at times “compete” for control of resources (e.g., a shared memory). Accordingly, programmers can use synchronization mechanisms to coordinate access to shared resources. However, the synchronization mechanisms often operate by serializing access to resources, reducing the level of parallelism.
In at least one embodiment, a method includes determining whether to elide a lock operation based on success of or failure of one or more previous transactional memory operations associated with one or more respective previous lock elisions. In at least one embodiment of the method, the lock operation is associated with a first access of a shared resource and the one or more previous lock elisions are associated with respective one or more previous accesses of the shared resource.
In at least one embodiment, an apparatus includes a plurality of processing cores. The plurality of processing cores includes at least a first processing core configured to determine whether to elide a lock operation based on success of or failure of one or more previous transactional memory operations associated with one or more respective previous lock elisions. In at least one embodiment of the apparatus, the first processing core includes transactional memory logic configured to determine success or failure of transactional memory operations.
In at least one embodiment, a non-transitory computer-readable medium encodes instructions to cause a processor to determine whether to elide a lock operation based on success of or failure of one or more previous transactional memory operations associated with one or more respective previous lock elisions. In at least one embodiment of the non-transitory computer-readable medium, the instructions are encoded in a shared library.
The present invention may be better understood, and its numerous objects, features, and advantages made apparent to those skilled in the art by referencing the accompanying drawings.
The use of the same reference symbols in different drawings indicates similar or identical items.
Cores 102 and 116 may execute instructions, such as instructions of threads 116 and 126, respectively. Execution of such instructions may generally occur in parallel between cores 102 and 116. However, if a thread contains one or more code sections (e.g., code section 118 or 128) that access a shared resource (such as shared resource 106), such code sections may be executed in conjunction with a lock call (e.g., lock calls 130 and 134, respectively). An example of such a code section is a “critical section” that modifies a memory location at shared resource 106. Other such critical sections used in connection with lock calls are known in the art.
To further illustrate, if thread 116 is to execute code section 118 that accesses shared resource 106, thread 116 can acquire a lock to shared resource 106 using lock call 130 (temporarily excluding thread 126 from accessing shared resource 106), execute code section 118, and then release the lock once execution of code section 118 has completed (e.g., using a suitable unlock call). Such lock operations generally reduce or eliminate the possibility of inter-thread conflicts associated with concurrent accesses of shared resources at the expense of serializing access to shared resources.
However, not all concurrent accesses to a shared resource necessarily conflict. For example, a conditional store instruction that does not successfully execute may not cause an actual conflict. As another example, accessing different fields of a shared resource may not cause an actual conflict. Accordingly, performing each lock operation can waste cycles by serializing access to shared resources when parallel access would not cause an actual conflict to occur.
Accordingly, cores 102 and 116 may execute instructions contained in libraries 112 and 122 (which can be loaded from an external source, such as shared resource 106). Libraries 112 and 122 may be a shared library that is dynamically linked at runtime. In at least one embodiment, libraries 112 and 122 include software routines 114 and 124 (i.e., processor-usable instructions stored on a tangible computer readable medium), respectively, which include instructions executable by a processor to intercept lock calls (e.g., lock calls 130 and 134) and to determine whether to elide the lock calls. As used herein, “eliding” a lock call and “elision” of a lock call refer to attempting one or more transactional memory operations (e.g., a load/store operation that succeeds or fails as a single atomic operation) in response to detecting the lock call. In at least one embodiment, a determination whether to elide the lock call is based on success of or failure of one or more previous transactional memory operations associated with respective one or more previous lock elisions corresponding to accesses of shared resource 106.
As will be appreciated, if such predictions are made too “aggressively,” performance of system 100 can be degraded by having to roll back the transactional memory operations (e.g., where a conflict is not predicted but occurs), wasting cycles due to aborts and retries. If such predictions are made too “conservatively,” then unnecessary lock operations may be performed (e.g., where a conflict is predicted but would not have occurred), bottlenecking resources and reducing system performance. Accordingly, in at least one embodiment and as described further below, software routines 114 and 124 are “adaptive” software routines that can cause elision to occur more often or less often based on success or failure of one or more previous lock elisions, predicted success of a prospective lock elision, or a combination thereof.
Cores 102 and 116 include synchronization logic 108 and 120, respectively, which may include transactional memory hardware. For example, in at least one embodiment, synchronization logic 108 and 120 include a set of hardware primitives that enable atomic operations on memory locations (e.g., a memory location at shared resource 106). One of skill in the art can use such primitives to build higher-level synchronization mechanisms. Some transactional memory apparatuses and methods are described in U.S. Patent Publication No. 2011/0208921, entitled “Inverted Default Semantics for In-Speculative-Region Memory Accesses,” naming as inventors Martin T. Pohlack, Michael P. Hohmuth, Stephan Diestelhorst, David S. Christie, and Jaewoong Chung, which is incorporated by reference herein in its entirety.
State machines 202 and 206 may indicate respective levels (e.g., levels 230 and 234), which may correspond to respective states of state machines 202 and 206. In addition, and as explained further below, software routine 114 may include instructions for maintaining one or more variables each associated with a state machine. For example, in the embodiment of
Referring to
As shown in
In at least one embodiment, in response to initial failure of the transactional memory operation, one or more retries are performed prior to aborting. The number of retries may correspond to the level of state diagram 300 (e.g., a higher level may correspond to fewer retries, since the likelihood of success may be lower). If after the retries (if any) the transactional memory operation is determined (e.g., by synchronization logic 108 of
Traversal of other levels of state diagram 300 may operate similarly, with successes at level 0 and failures at level n (i.e., “TX success” and “TX failed” paths, respectively) causing no change in level. After execution of the code section, thread 116 can continue operation (e.g., execution of instructions).
In at least the embodiment of state diagram 300 depicted in
In at least one embodiment, state diagram 300 illustrates “adaptive” operation of state machine 202 by indicating a percentage of times that lock calls are elided (e.g., a frequency of lock elisions relative to lock operations performed). For example, for each level k, the corresponding percentage of lock elisions may be given by ½k (i.e., lock elisions made approximately 100%, 50%, and 25% of the time for levels 0, 1, and 2, respectively). Eliding a lock call, even when state diagram 300 indicates that a transactional memory operation is likely to fail, can ensure that state machine 202 “adapts” to current conditions. To illustrate, suppose level 230 of state machine 202 corresponds to level 2 of state diagram 300, which may in turn indicate that a transactional memory operation attempt is likely to fail. The transactional memory operation may still be attempted for a certain percentage of lock calls (e.g., 1 of 4, or 25% of the time) such that if current conditions have changed to make elision more favorable (e.g., thread 126 is idle), then elisions can be performed and level 230 of state machine 202 will “adaptively” change even though level 230 indicates that elision is not currently likely to succeed. Further, by eliding even when level 230 of state machine 202 indicates a low or zero percent chance to elide (or a low or zero percent likelihood of success), level 230 of state machine 202 will not unnecessarily remain “stuck.”
In at least one embodiment, a variable is defined (e.g., using setup code 214 of
Because a lower level of state diagram 300 generally corresponds to increased potential for parallel execution of code sections and therefore enhanced performance, approaches for reducing level 230 of state machine 202 are contemplated. According to a first approach, a next lock call is elided after a level reduction. For example, suppose a lock call is encountered while level 230 of state machine 202 corresponds to level 2 and the lock call is elided. If the corresponding transactional memory operation is performed successfully, level 230 of state machine 202 is accordingly changed to correspond to level 1. According to the first approach, a transactional memory operation would then be attempted in response to a subsequent lock call, irrespective of the level 230 of state machine 202. According to a second approach, following a successful elision, a number of lock calls are performed prior to eliding another lock call. The number of lock calls may be level specific (e.g., for higher levels, more lock operations may be performed than for lower levels prior to eliding a lock call). As will be appreciated, the first approach may be able to adapt more quickly to environment changes but might be more likely to “overshoot” (e.g., mis-speculate), while the second approach may be “safer” by allowing level 230 of state machine 202 to “settle” prior to eliding subsequent lock calls. In at least one embodiment, access code 222 of
In at least one embodiment, software routine 114 includes state machines to track successes and failures of elisions of each lock call (e.g., each mutex) for each thread. Further, such data may be stored in a hash table indexed by addresses of each mutex (e.g., in a hash table stored in a cache included in core 102). According to a particular illustrative embodiment, setup code 214 of
Referring to
As shown in
Referring to
If at 520 a determination is made to elide the lock call, flow diagram 500 includes attempting a transactional memory operation, at 524. If at 528 the transactional memory operation is successful, a level (e.g., level 230) of a state machine (e.g., state machine 202) is decremented, at 532. A count (e.g., variable 240) may be reset, also at 532, as described with reference to
If at 528 the transactional memory operation is not successful, flow diagram 500 may include determining whether to retry the transactional memory operation, at 540. If no retries are to be made, then flow diagram 500 continues by incrementing the level of the state machine and resetting the count, at 548. Flow diagram 500 then includes performing the lock operation (“falling back” to the lock operation), at 552.
Various structures described herein may be implemented using instructions executing on a processor or by a combination of such instructions and hardware. Instructions may be encoded in at least one tangible (i.e., non-transitory) computer-readable medium that can be read by a processor. As referred to herein, a tangible computer-readable medium includes at least a disk, tape, or other magnetic, optical, or electronic storage medium. In addition, the computer-readable media may store data as well as instructions. In at least one embodiment, a non-transitory computer-readable medium encodes instructions to cause (e.g., instructions executable by) a processor to determine whether to elide a lock operation based on success of or failure of one or more previous transactional memory operations associated with one or more respective previous lock elisions, and to perform other operations described herein with reference to
Further, various structures described herein may be embodied in computer-readable descriptive form suitable for use in subsequent design, simulation, test or fabrication stages. Various embodiments are contemplated to include circuits, systems of circuits, related methods, and one or more tangible computer-readable media having encodings thereon (e.g., VHSIC Hardware Description Language (VHDL), Verilog, GDSII data, Electronic Design Interchange Format (EDIF), and/or a Gerber file) of such circuits, systems, and methods, all as described herein, and as defined in the appended claims.
The description set forth herein is illustrative, and is not intended to limit the scope set forth in the following claims. For example, structures and functionality presented as discrete components in the exemplary configurations may be implemented as a combined structure or component, and vice versa. As another example, while state machines 202 and 206 of