1. Technical Field
The present disclosure relates to architectural extensions of a microprocessor that support the execution of a software runtime system for thread-level speculation.
2. Discussion of Related Art
On multiprocessor and multi-core computers, the multiple processors or cores may execute multiple threads concurrently, with different threads of a given process running on different processors or cores. For example, in systems using symmetric multiprocessing, typically any processes, including those of the operating system, can run on any available processor, and the threads of a single process can run on different processors at the same time.
In multiprocessor systems, shared memory needs to be kept consistent, wherein the integrity of data stored in local caches of each processor is preserved. This is known as cache coherency. Various models and protocols have been devised for maintaining cache coherence, such as the MESI protocol, MSI protocol and MOESI protocol. Typically, the cache coherence protocols in multiprocessors support a sequential consistency model. For sequential consistency, a globally (i.e., across all memory locations) consistent view of memory access operations is taken. For example, a system provides sequential consistency if every node of the system sees the write operations on the same memory cell in the same order.
Thread-level speculation (TLS) is a technique that allows a sequential computation to be divided into a sequence of “epochs” and enables sequentially consistent, concurrent execution of the epochs even though memory access in different epochs may be data dependent. For thread-level speculation, the epochs are ordered corresponding to their occurrence in the original sequential program. In TLS, epochs are typically executed in different threads.
In general, speculative execution is the execution of code the result of which may not be needed. In the context of thread-level speculation, speculative execution means that side-effects such as memory updates or I/O operations are not made visible to other epochs until the execution of the epoch performing the side-effect is confirmed to be safe, e.g., all epochs preceding it in the linear order have successfully completed. When epochs with data dependences execute concurrently, it may occur that an epoch, e.g., denoted “e1”, reads a value from memory that is (later in the real time) changed by another epoch, e.g., denoted “e2”, where e2 precedes e1 in the linear order. In such event, execution of epoch e1 fails.
The functional aspects of thread-level speculation may include the start of a new epoch, versioning, conflict detection, rollback and ordered commit, etc. Start of a new epoch refers to a mechanism to create a speculative execution context, define the epoch's position in the speculation order, and start execution of the epoch. Versioning refers to a mechanism to confine updates of shared memory within the speculative execution context of an epoch until the epoch becomes non-speculative. Versioning support allows a system to tolerate output (write after write) and anti- (write after read) data dependences among concurrent tasks. Conflict detection refers to a mechanism to detect data dependence violations among tasks. For example, an epoch reads a value from a location that is updated concurrently by another epoch that is a predecessor in the speculation order, which is a violation of flow dependence: read after write. Rollback refers to a mechanism to restart the computation of a failed speculative computation. Ordered commit refers to a mechanism that controls the order in which epochs become non-speculative and makes their side effects accessible.
Hardware support for TLS commonly retains speculative versions of data in cache or a separate hardware buffer, in either case, with limited capacity. The cache coherence protocol may be extended to detect a violation of data flow dependence among epochs executing concurrently on different hardware threads. Hardware-centric implementations of TLS affect design complexity, for example, integral processor components, such as the load-store unit and cache, are affected by TLS extensions, and conflict detection is performed at a granularity of cache lines, which can lead to false conflicts and decrease the parallelization efficiency of the mechanism.
Software support for TLS has been proposed that primarily relies on auxiliary data and control structures that enable data versioning and conflict detection. Auxiliary data and control structures and checks are commonly synthesized by the compiler. A reduced version of all epochs may be executed sequentially for the purpose of conflict detection and, after if a conflict is determined/detected, a full version of the epochs may be executed without speculation, in parallel, otherwise resorting to serial execution.
Systems for TLS may provide baseline TLS operations in hardware and may offload selected operations such as buffering of speculative state to software. In these and other contexts, there is a need for reduced hardware implementation complexity in computer systems with support for TLS.
According to an exemplary embodiment of the present invention, a system for thread-level speculation includes a memory system for storing a program code, a plurality of registers corresponding to one or more execution contexts, for storing sets of memory addresses that are accessed speculatively, and a plurality of processors, each providing the one or more execution contexts, in communication with the memory system, wherein a processor of the plurality of processors executes the program code to implement method steps of dividing a program into a plurality of epochs to be executed in parallel by the system, wherein one of the epochs is executed non-speculatively and the other epochs are executed speculatively, determining a current epoch to be executed on an execution context, encoding addresses read during execution of the current epoch, encoding addresses written during execution of predecessor epochs of the current epoch, and encoding addresses written during execution of successor epochs of the current epoch.
According to an exemplary embodiment of the present invention, a system for thread-level speculation includes a memory system for storing a program code, a first register, a second register and a third register, the first, second and third registers for storing sets of memory addresses that are speculatively accessed, and a plurality of processors, each providing the one or more execution contexts, in communication with the memory system, wherein a processor of the plurality of processors executes the program code to implement method steps of dividing a program into a plurality of epochs to be executed in parallel by the system, wherein one of the epochs is executed non-speculatively and the other epochs are executed speculatively, and determining a current epoch to be executed on an execution context, wherein the first register stores addresses read during execution of a current epoch, the second register stores addresses written during the execution of predecessor epochs of the current epoch, and the third register stores addresses written during the execution of successor epochs of the current epoch.
The present invention will become readily apparent to those of ordinary skill in the art when descriptions of exemplary embodiments thereof are read with reference to the accompanying drawings.
Hereinafter, exemplary embodiments of the present invention will be described with reference to the accompanying drawings.
In various exemplary embodiments of the present invention, a processor or multiprocessor system has multiple hardware execution contexts (or “execution contexts”). A computer system with support for thread-level speculation (TLS), according to various exemplary embodiments of the present invention, includes various architectural extensions and resources that support operations of TLS.
It is to be understood that the present invention may be implemented in various forms of hardware, software, firmware, special purpose processors, or a combination thereof. In one embodiment, the present invention may be implemented in software as an application program tangibly embodied on a program storage device. The application program may be uploaded to, and executed by, a computer system comprising any suitable architecture.
Referring to
The computer platform 101 also includes an operating system and micro instruction code. The various processes and functions described herein may either be part of the micro instruction code or part of the application program (or a combination thereof) which is executed via the operating system. In addition, various other peripheral devices may be connected to the computer platform such as an additional data storage device and a printing device.
It is to be further understood that, because some of the constituent system components and method steps depicted in the accompanying figures may be implemented in software, the actual connections between the system components (or the process steps) may differ depending upon the manner in which the present invention is programmed. Given the teachings of the present invention provided herein, one of ordinary skill in the related art will be able to contemplate these and similar implementations or configurations of the present invention.
It will be appreciated that the hardware depicted in
It is to be understood that a program storage device can be any medium that can contain, store, communicate, propagate or transport a program of instructions for use by or in connection with an instruction execution system, apparatus or device. The medium can be, for example, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of a program storage device include a semiconductor or solid state memory, magnetic tape, removable computer diskettes, RAM (random access memory), ROM (read-only memory), rigid magnetic disks, and optical disks such as a CD-ROM, CD-RAN and DVD.
A data processing system suitable for storing and/or executing a program of instructions may include one or more processors coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories that provide temporary storage of at least some program code to reduce the number of times code must be retrieved from bulk storage during execution.
The execution contexts ec1A-ec4A (202, 204, 206, 208) and ec1B-ec4B (201, 203, 205, 207), in
Referring to
As shown in
Referring to
Referring to
In an exemplary embodiment of the present invention, the three signature registers 305-307 are implemented as (i) a read sig-register, which stores an encoded addresses read during execution of the current epoch, (ii) a write-pred sig-register which stores encoded addresses written during the execution of epochs that precede the current epoch in the sequential execution (predecessor epochs), and (iii) a write-succ sig-register which stores encoded addresses written during the execution of epochs that succeed the current epoch in the sequential execution (successor epochs).
Signature registers 305-307 may store 1-4K bits, for example. Signature registers 305-307 may contain superset representations of address sets where individual addresses are encoded as bitmask hash values, such as for example, through a Bloom filter.
The processor 302 may issue instructions to reset (clear) the individual sig-registers 305-307, instructions to swap register contents, and/or instructions to add a datum (e.g., address of a memory location) to a set register, for example, to compute set intersection and set membership based on Bloom filter operations. A processor may provide a mechanism to add a datum (e.g., an address of a memory location) to a signature register of another processor.
In various exemplary embodiments of the present invention, a processor or multiprocessor system has multiple hardware execution contexts. In an implementation of multiple hardware execution contexts, a linear order is maintained among the hardware contexts. For example, the linear order may be specified through a unique identification (“id”) that is associated with each hardware execution context. This id can be held in a dedicated register 304. A hardware execution context includes a dedicated register 303 that specifies if the execution context is speculative or non-speculative. In an exemplary embodiment of the present invention, the information held in register 303 is used to optimize memory read and write operations, wherein speculative execution memory read and write accesses are accompanied by software read and write barriers, and wherein execution of such barriers is not required in a non-speculative execution. If a processor computes in non-speculative mode, it is the “owner of the commit token”. There is exactly one commit token in the system. A processor may issue an instruction that enables an execution context to determine if it is owner of the commit token. An issued instruction may pass on the commit token from one hardware context to its successor in the linear order.
A computer system with support for TLS, according to various exemplary embodiments of the present invention, encompasses a number of functional aspects including ones for (1) starting a new epoch, (2) versioning updates of shared memory, (3) conflict detection, (4) rollback of an epoch that failed and (5) ordered commit of epochs.
For purposes of clarity and simplicity, the following disclosure is generally directed to an execution context that executes exactly one epoch from start to end. In such cases, execution contexts are ordered according to the successor/predecessor relationship of the epoch that they execute.
Referring to
When an epoch starts execution, read and write-succ sig-registers of the execution context are reset, e.g., their values represent the empty set. The write-pred sig-register is unmodified.
Table 1 includes an example of pseudo-code for implementing the start an epoch of
Versioning occurs if a shared location is written by a speculative thread. In such event, a software write barrier is executed.
As illustrated in
Table 2 includes an example of pseudo-code for implementing the software write barrier of
In addition to adding or updating an entry in the write log, a write barrier adds the address of the location that is written to the write-pred signature registers of other processors (701 and 702).
Dual to write operations, read operations need to retrieve values from speculative storage if the read refers to a shared location that may have been written before in the same epoch. This is achieved by a software read barrier that is associated with read operations to shared locations.
Table 3 includes an example of pseudo-code for implementing the software read barrier of
Conflict detection is done through set intersection of the read sig-register and the write-pred sig register, in block 901. If the intersection is empty, in block 903, then the current execution context (performing the conflict detection) is free of conflicts with predecessor contexts (case 906). If the intersection is not empty, then a conflict may have occurred (case 907). The uncertainty over whether a conflict has occurred or not is due to the signatures being an over-approximation of sets of addresses. Hence a conflict detected on the basis of the signature intersection may be spurious (902). Thus, a more precise validation can be done, for example, in one of the following two ways:
(i) Iterate through the read-log and verify the values previously read are the same as those found in the coherence domain (case 904). This requires that the read log contains addresses and values. It should be ensured that all predecessor epoch(s) have their updates installed in the coherence domain before performing this precise validation. The cost of this validation is O(|read|). If the validation is successful for all entries in the read log, in block 905, then speculative execution succeeds and the execution context can complete the epoch (commit operation) (case 906). Otherwise, the execution context repeats execution of the epoch (roll back) (case 907).
(ii) Alternatively, intersect the precise read set with the precise write-set-pred of each processor core. Here, a precise representation of the read and write sets is retained in all cores. The cost of this validation for E execution contexts is O(P |read| |write|). This option is not shown in
Conflict detection can be accelerated with hardware support. Conflict detection can be triggered explicitly by software as needed at commit time.
Table 4 includes an example of pseudo-code for implementing the conflict detection process of
Conflict detection is done using the three signature registers in each execution context, e.g., hardware execution thread.
Operations of signature registers used during the execution of an epoch, according to an exemplary embodiment of the present invention, are described below: When an epoch performs a read operation on shared memory, the address of the location that is read is added to the read register. Only one read per location needs to be recorded, since the signature register represents a set on which add operations are idempotent.
When an epoch performs a write operation on shared memory, the address of the location is broadcast to other execution contexts (701 and 702), and the address is added to either the write-pred or write-succ register of other execution contexts. The choice of register depends on the pred/succ relation that writing context has with the other context where the write signature register is updated. An individual execution context is aware of its pred/succ relation relative to other execution contexts based on the total order among epochs that execute on the individual contexts. Only one write to a specific location needs to be broadcast, since the signature register represents a set on which add operations are idempotent. For example, an operation is idempotent if it can be applied a second time without altering the result obtained by the first application of the operation.
At commit, the write-succ register is copied to the write-pred register (1003) and the write-succ register is cleared (1004). This prepares the execution context that ‘moves’ to the head of the speculation chain, and start execution of a new epoch.
An epoch is rolled back when a validation operation 1002 (conflict detection), fails. Rollback clears read-log, write-log and read register. The control transfer to the start of the epoch may be done in software, e.g., through the back-edge of a while loop or a light weight setjmp/longjmp (1009).
It is optional that an epoch signals its successor(s) on the occurrence of a failed validation (1007 and 1008). Such signaling may trigger an additional validation step in the successor epoch(s), which may detect the necessity for rollback early.
Table 5 includes an example of pseudo-code for implementing the ordered commit process of
The pseudo-code of Tables 2-5 is provided by way of example only and embodiments of the present invention are not to be construed as limited thereby.
Although exemplary embodiments of the present invention have been described in detail with reference to the accompanying drawings for the purpose of illustration and description, it is to be understood that the inventive processes and apparatus are not to be construed as limited thereby. It will be apparent to those of ordinary skill in the art that various modifications to the foregoing exemplary embodiments may be made without departing from the scope of the disclosure.