The use of atomic sections to synchronize shared memory accesses between multiple threads is an alternative to lock-based programming. Atomic sections raise the level of abstraction for a programmer. Using atomic sections, the programmer does not correlate shared data with a protecting lock. Consequently, with locks gone, deadlocks are gone too. A software transactional memory (STM) library implements atomic sections with synchronization between threads in order to maintain correctness of the data, and achieve a high degree of runtime performance. A runtime instance of an atomic section is usually referred to as a transaction. Due to memory conflicts, the transactions may abort, which reduces the efficiency of concurrent systems, and increases computational expenses. In the absence of conflicts, transactions successfully commit by exposing changes to other threads. To reduce aborts, an STM has the flexibility to choose from multiple policies governing conflict detection and resolution. These policies specify when a transaction takes exclusive control of a memory address. An eager policy detects conflicts early, usually by trying to acquire a lock on encountering a store to a shared memory location. While this policy has the advantage of detecting doomed transactions early, it results in holding locks for a longer duration, potentially reducing concurrency. On the other hand, a lazy policy detects conflicts late, usually by trying to acquire locks at commit time. While this policy has the advantage of holding locks for a shorter duration, it may result in wasted work since doomed transactions are detected late.
Certain exemplary embodiments are described in the following detailed description and in reference to the drawings, in which:
The atomic section 102 is a block of code that appears to execute in an indivisible manner. The STM applications 116 are multi-threaded software applications that use shared memory locations 106. Each STM application 116 includes one or more atomic sections 102.
A single transaction 104 is a dynamic execution instance of a compiled atomic section 102. In STM systems, functions and references can be transactional. A function is transactional if it can be transitively called from an atomic section. A reference is transactional if it executes under the control of a transactional memory system.
The SR 118 is a transactional load or store in the intermediate representation (IR) of an application 116. The intermediate representation of an application 116 refers to a translation of a program that is usable by a compiler for code generation. As referred to herein, a read or write memory reference is transactional, unless specified otherwise.
STM systems execute transactions 104, all of which use a shared memory, concurrently. Aborts may arise from memory conflicts between different transactions 104. Memory conflicts occur when two different transactions 104 reference, or appear to reference, the same shared memory location 106, and at least one of the references is a write.
A transaction 104 is implemented using the ORECs 108, policies 110, victims 112, and aborters 114. For every shared memory location 106 that the transaction 104 updates, a corresponding OREC 108 is acquired. In an embodiment, an OREC may just be a lock. The OREC 108 gives the transaction 104 exclusive control of the shared memory location 106. The transaction 104 may acquire the OREC 108 for a shared memory location 106 at a time specified by either an eager or lazy ownership acquire policy 110.
An abort involves two memory references: the victim 112 of the abort and the aborter 114. Typically, the transaction 104 aborts because execution of one transaction 104 cannot complete a load or store memory reference. In such a case, the memory reference in the transaction 104 that cannot complete is the victim 112. Further, the aborter 114 is the memory reference in another different transaction that prevented the victim 112 from proceeding.
A memory conflict may be detected, depending on the policy 110, either early or late during execution of the transaction 104. If the policy 110 is eager, the memory conflict is detected early. If the policy 110 is lazy, the memory conflict is detected late.
For a given atomic section 102, it is hard to tell in advance which policy performs better. With an eager policy, wasted work is avoided if the transaction 104 will abort, but on the downside, the lock is held longer, which potentially reduces concurrency. On the other hand, using a lazy policy delays lock acquisition, thereby producing a small contention window at commit time alone. However, the lazy policy can result in a lot of wasted work if the transaction 104 aborts.
Many STMs detect both read-write and write-write conflicts using the same policy. Some other STMs use mixed invalidation, whereby write-write conflicts are detected eagerly, and read-write conflicts are detected lazily. In one embodiment, a read transactional reference is handled the same way regardless of the policy. For a write reference, an eager policy indicates that the OREC 108 is acquired when the write is encountered at runtime. Under the eager policy, the conflict, if any, is detected when trying to acquire the OREC. A lazy policy indicates that the OREC 108 is acquired in the commit phase. According to this policy, any conflict is detected during the commit phase, i.e., late.
Embodiments of the present techniques automatically determine ownership acquire policies for selected memory references within an STM application 116. By determining different policies at the memory reference granularity, embodiments provide an automated way to reduce the number of aborts and wasted work. Information related to contention or abort patterns among memory references in prior executions may be used to determine policies 110 for each memory reference in a subsequent execution. Further, modifications to policies 110 may be propagated throughout an application 116 in a way that reduces the number of aborts.
In the second phase, the compiler 204 uses the optimization information 214 to generate an optimized executable 216. The optimized executable 216 may use a modified policy 110 for specific memory references. The profile database 210 may include a runtime abort graph (RAG), a list of readset (Srd) and writeset (Swr) sizes for references correlated with SRs 118, a list of source locations for SRs 118, a list of source locations for atomic sections 202, and a list of application specific information.
The list of source locations for SRs captures the location information for every SR. The list of source locations for atomic sections 202 captures the location information for every atomic section 202. The application specific information may include, for example, speculative readset (Srd) and speculative writeset (Swr) sizes for references correlated with SRs 118, the speculative sizes computed using policies different from those in the RAG. The optimization information 214 may include new optimized policies for specific memory references as computed by the offline analyzer 212.
The profile database 210 may include a runtime abort graph (RAG). A RAG is a directed graph, where each node corresponds to an SR. An edge captures an abort relationship and is directed from the aborter 114 to the victim 112. As stored in the profile database 210, a node may have the following annotations: αo: an id of the dynamically outermost enclosing atomic section 202 containing the SR; SRid: an identifier of the SR; L: source code information of the SR; Srd: average readset size of the outermost enclosing transaction at the time of the abort; Swr: average writeset size of the outermost enclosing transaction at the time of the abort, AN: the total number of aborts suffered by the victim 112.
Every node in the RAG may be keyed with the duple Nk=<αo,SRid> that uniquely identifies an SR 118 in the context of the dynamically outermost enclosing atomic section 102. An edge is annotated with AE, the total number of times the source node (i.e. the aborter) aborts the target node (i.e. the victim). For a given node, AN is computed as the sum of AE over all incoming edges. It is noted that Srd and Swr are not applicable if the node is not a victim 112.
Every atomic section 102 and SR 118 are assigned a program-wide unique identifier. This is achieved by using a to uniquely identify an atomic section 102 globally, β to uniquely identify a transactional function globally, and γ to uniquely identify a memory reference within the lexical scope of a transactional function. The duple SRid=<β, γ> uniquely identifies a transactional reference within an entire application, which may include a number of transactions. The RAG may also include some source information as well, referred to as L=<λ,ρ,τ,>, where λ is the mangled name of the caller function and ρ and τ are the line and column numbers respectively.
In embodiments, only the outermost atomic section 102 is tracked for profiling purposes. If an SR 118 is contained within more than one distinct outermost atomic sections 102 (e.g. calls to a function containing the SR 118 from two distinct atomic sections), a separate node is added to the RAG for each such instance.
Embodiments identify conflict detection policies that improve performance. Pair-wise solutions are found at the reference level. This is accomplished in the context of improving performance across the application.
The performance penalty incurred by a transactional reference is determined by the aborts it suffers and the work that is wasted due to the abort. This penalty (Csr) may be computed by defining the cost of an SR (or the corresponding RAG-node) as Csr=AN×(Srd+Swr), where AN, Srd, and Swr respectively represent the abort count, the readset size, and the writeset size of the RAG-node. The cost of a RAG-edge is computed as Ce=AE×(Srd+Swr), where AE, Srd and Swr respectively represent the abort count of the edge, the readset size, and writeset size of the target RAG-node. The total cost of the RAG, Ctot is the summation of Csr over all nodes.
In embodiments, the SRs 118 for reads (Rd) may have a fixed policy, but the writes could follow either an eager (Wre) or lazy (Wrl) policy. Accordingly, the various abort scenarios may be represented using the following shorthand: Aborter->Victim. For example,
In
In
The scenario shown in
Each reference within a given transaction 104 may use either an eager or lazy policy, a scheme called reference level hybridization. In such an embodiment, a compiler or programmer interface may be provided for specifying policies at the memory reference level. In one embodiment, for a store to a shared memory reference within a transaction, a call to TxStore( ) signifies transactional handling of that store. Instead of, or in addition to, a store command, e.g., TxStore( ) a new interface may specify the policy for each memory reference, e.g., by introducing TxStore_eager( ) and TxStore_lazy( ) TxStore_eager indicates that the store memory reference should use the eager policy. TxStore_lazy indicates that the store memory reference should use the lazy policy. Using such an interface, regardless of the default policy in use by the atomic section 102, a different policy may be used for a specific transactional memory reference. When the compiler sees the TxStore( ) command, the default policy is used for that specific reference. The compiler also uses the specified policies according to the TxStore_eager( ) or TxStore_lazy( ) commands. Because transactional reads behave the same regardless of policies, a different interface to specify read policies may not be implemented.
The STM follows a log policy constraint. All transactional references, regardless of policy, use buffered updates. Accordingly, both eager and lazy transactional stores use the same kind of logging.
Such an embodiment may include some constraints regarding read transactional reference. Since buffered updates are used for both eager and lazy policies, each read reference checks the writeset for the most recent value written by the current transaction 104. Validation may also performed by a read transactional reference. These changes result in the same read barrier for eager and lazy policies.
The table 400 includes four columns. The ATOMIC SECTION COMMAND 402 specifies each command line of an example atomic section 102. The VALUE OF X 404 specifies the value of a shared memory location 106, x, after the command is executed. The last two columns specify the contents of an example UNDO-LOG 406 and REDO-LOG 408, generated according to different log policies.
As shown, two updates to the shared location x occur in the example atomic section 102. A lazy policy is used for one reference, and an eager policy for the other. The UNDO-LOG 406 is maintained for eager writes and the REDO-LOG 408 is maintained for lazy writes according to their log policies. When the lazy write is executed, the entry for shared location x in the UNDO-LOG 406 is not updated. Instead, the new value is logged into the REDO-LOG 408. After executing the eager write, the OREC 108 for x is acquired and the location is directly modified. The old value of the shared location is stored in the UNDO-LOG 406. At the commit point, the STM implementation is, however, left with a problem because of the presence of both undo- and redo-log entries for shared location x. If the redo-log 408 is applied, the VALUE OF X 404 at the end of the transaction 104 will be 1, which is incorrect. The root cause is that the STM does not know the program order dependencies between entries in the UNDO-LOG 406 and REDO-LOG 408 for a given shared location. For this reason, in the absence of dependencies across entries in different logs 406, 408, the same logging policy has to be employed in a given transaction 104 in order to maintain correctness. Since a lazy transactional write does not acquire ownership of shared datum until commit time, direct updates for such writes would introduce a data race. Hence, buffered updates are used for both lazy and eager transactional writes using a redo log.
Consider the case when a write follows another write to the same location. When an eager write (Wre) is followed by a lazy write (Wrl): Wre will acquire a lock in a successful transactional write and buffer the new value into the writeset. When Wrl is executed, the corresponding lock is already held by the current transaction 104. In embodiments, this scenario is anticipated by the STM 100. The lazy write will buffer the latest new value into the writeset. In the commit phase, the STM 100 may be faced with a lazy writeset entry that has the corresponding lock held by the current transaction 104. The STM 100 anticipates such a scenario. Consequently, no additional lock acquire is necessary in the commit phase.
When a lazy write (Wrl) is followed by an eager write (Wre) to the same memory location, the former will not acquire any lock, but just buffer the new value. The latter will acquire the lock and buffer the new value. During the commit phase, the implementation may be faced with a lazy writeset entry that has the corresponding lock held by the current transaction 104. This is similar to the previous scenario and is anticipated by the STM. Consequently, no additional lock acquire is necessary in the commit phase.
The above 2 scenarios hold regardless of the presence of false conflicts. For reference-level hybridization, consistency is maintained between multiple references within an atomic section 102. A data race is not created by the implementation of reference-level hybridization. This is because a location is protected by the same OREC 108 that is to be held by the thread trying to modify that location.
In one embodiment, the determination whether to modify the policy 110 may be based on the performance penalty of an abort under the initial policy versus the cost of executing the SR 118 according to the modified policy 110. The performance penalty incurred by an SR 118, alternately a RAG node, is determined by the number of resulting aborts, and the work that is wasted due to an abort. This penalty may be represented as shown in Equation 1:
C
sr
=A
N×(Srd+Swr) (1)
where AN, Srd, and Swr respectively represent the abort count, the readset size, and the writeset size of the SR 118. The cost of a RAG-edge may be represented as shown in Equation 2:
C
e
=A
E×(Srd+Swr) (2)
where AE, Srd and Swr respectively represent the abort count of the edge, the readset size, and writeset size of the target RAG-node. The total cost of the RAG, Ctot is simply the summation of Csr over all nodes.
Given a combination of policies for a victim and an aborter, the compiler may want to select a different combination with the aim of improving runtime performance. Since the policy 110 for any SR 118 may be either eager or lazy, and the SR 118 can be a read or a write, there are up to, 24=16, potential abort scenarios. However, the aborter 114 cannot be a read, leaving only 23=8 potential scenarios. In embodiments, there are no differences modeled between eager and lazy reads. As such, there remain just 6 potential abort scenarios that the system 100 may reduce. In embodiments, the compiler 204 may modify the policy 110 for the victim 112 and the aborter 114 as shown in Table 1.
Each of the potential abort scenarios is listed under the Initial Policies column. In the Modified Policies column, the policies 110 listed for the aborter 114 and the victim 112 are configured to reduce aborts for the victim SR 118. For example, in the second row of Table 1, the aborter 114 is a write performed with a lazy policy. The victim 112 is also a lazy write. As shown, the compiler 204 changes the policy of the victim 112 to an eager write. In this way, the aborts of the victim 112 in such scenarios may be reduced. It is noted that the Modified Policies of Table 1 are locally preferred solutions in the sense that they consider only the cost of the victim 112, but not the aborter. Further, the cost is determined in isolation from the rest of the victims 112 in the application 102.
Given a RAG and a table of locally preferred solutions, e.g., Table 1, embodiments select the policy 110 for every atomic section 102 that reduces the total cost of the RAG. In one embodiment, local solutions may be propagated throughout the entire application 116.
The method may begin at block 502. Blocks 502-510 are repeated for each SR 118. At block 504, the analyzer 212 determines whether there is a locally preferred solution for a potential abort. If not, the next SR 118 is considered at block 502. If there is a local solution for an SR 118, at block 506, the change in cost of a transactional execution of this SR and all adjacent SRs is determined using this preferred solution. Given an SR 118, another SR is adjacent if these two have an edge between them in the RAG. At block 508, it is determined whether the change in cost is beneficial. If not, the next SR 118 is considered at block 502. If found beneficial, at block 510, the policy of this SR is changed in the RAG. Accordingly, the policy of this SR 118 is changed. In one embodiment, a compiler may use the RAG in the second phase to determine which policies to apply for each SR 118.
The system 600 may include a server 602, in communication with clients 604, over a network 606. The server 602 may include a processor 608, which may be connected through a bus 610 to a display 612, a keyboard 614, an input device 616, and an output device, such as a printer 618. The input devices 616 may include devices such as a mouse or touch screen. The server 602 may also be connected through the bus 610 to a network interface card 620. The network interface card 620 may connect the server 602 to the network 606. The network 606 may be a local area network, a wide area network, such as the Internet, or another network configuration. The network 606 may include routers, switches, modems, or any other kind of interface device used for interconnection. In one example, the network 606 may be the Internet.
The server 602 may have other units operatively coupled to the processor 612 through the bus 610. These units may include non-transitory, computer-readable storage media, such as storage 622. The storage 622 may include media for the long-term storage of operating software and data, such as hard drives. The storage 622 may also include other types of non-transitory, computer-readable media, such as read-only memory and random access memory. The storage 622 may include the machine readable instructions used in examples of the present techniques. In an example, the storage 622 may include a shared memory 624, a STM runtime library 626, and transactions 628. The shared memory 624 may be storage, such as RAM, that is shared among transactions 628 invoking routines from the STM runtime library 626. The transactions 628 are dynamic execution instance of compiled atomic sections 102. The transactions 628 invoke STM accesses 630 from the STM runtime library 626. The STM accesses 630 may be policy-specific or non-specific accesses to the shared memory 624 at the memory reference level. For example, the STM accesses 628 may include the TxStore( ), TxStore_eager( ) and TxStore_lazy( ) commands described with reference to