Some embodiments of the present invention may relate generally to software optimization, and/or to optimizing sequential loops for speculative parallel execution during code compilation.
In computers with the ability to perform parallel processing, sequential loops in computer code can often be transformed with the use of parallel threads to allow more parallel execution of the loop. As seen, for example, in
Components/terminology used herein for one or more embodiments of the invention are described below:
In some embodiments, “computer” may refer to any apparatus that is capable of accepting a structured input, processing the structured input according to prescribed rules, and producing results of the processing as output. Examples of a computer may include: a computer; a general purpose computer; a supercomputer; a mainframe; a super mini-computer; a mini-computer; a workstation; a microcomputer; a server; an interactive television; a hybrid combination of a computer and an interactive television; and application-specific hardware to emulate a computer and/or software. A computer may have a single processor or multiple processors, which may operate in parallel and/or not in parallel. A computer may also refer to two or more computers connected together via a network for transmitting or receiving information between the computers. An example of such a computer may include a distributed computer system for processing information via computers linked by a network.
In some embodiments, a “machine-accessible medium” may refer to any storage device used for storing data accessible by a computer. Examples of a machine-accessible medium may include: a magnetic hard disk; a floppy disk; an optical disk, such as a CD-ROM or a DVD; a magnetic tape; a memory chip; and a carrier wave used to carry machine-accessible electronic data, such as those used in transmitting and receiving e-mail or in accessing a network.
In some embodiments, “software” may refer to prescribed rules to operate a computer. Examples of software may include: code segments; instructions; computer programs; and programmed logic.
In some embodiments, a “computer system” may refer to a system having a computer, where the computer may comprise a computer-readable medium embodying software to operate the computer.
The foregoing and other features and advantages of the invention will be apparent from the following, more particular description of embodiments of the invention, as illustrated in the accompanying drawings wherein like reference numbers generally indicate identical, functionally similar, and/or structurally similar elements. The left most digits in the corresponding reference number indicate the drawing in which an element first appears.
Embodiments of the invention are discussed in detail below. While specific exemplary embodiments are discussed, it should be understood that this is done for illustration purposes only. A person skilled in the relevant art will recognize that other components and configurations can be used without parting from the spirit and scope of the invention.
In an exemplary embodiment, the method of the present invention may be part of a compiler and may optimally transform a sequential computer program loop into a speculative parallel thread (SPT) execution loop during code compilation. The SPT loop may be optimized such that the cost of re-execution (i.e., the misspeculation cost) is minimized subject to the constraint that the pre-fork region partition size does not exceed a pre-specified maximum requirement.
The resulting pre- and post-fork regions may be optimal for that loop. Then, if the pre- and post-fork regions meet specified partitioning and SPT loop criteria at block 208, the loop may be transformed into an optimal SPT loop 212 in block 210. If the pre- and post-fork regions do not meet the partitioning criteria, then the sequential loop may not be a candidate for SPT partitioning and the process may continue with block 214, where no SPT is created.
In the dependence graph G that may result from block 204, all intra-iteration edges may be forward edges (i.e., the arrows 304 may all point toward the bottom of the loop in
Once the dependence graph G is built for the sequential loop, the loop may be partitioned in block 206. An optimal partition, if one exists, may be found within the set of legal partitions. In an exemplary embodiment, the method of the present invention may search in the set of legal partitions that include the movement of violation candidates, because only the movement of violation candidates may reduce the misspeculation cost. For all of the possible legal partitions that may include a movement of at least one violation candidate into the pre-fork region, the resulting size of the pre-fork region S and the number of re-executed instructions in the speculative executed iteration (i.e., the misspeculation cost) C may be considered. If the size S of the pre-fork region is too large compared to a maximum allowed size, then the partition may not be optimal. The partition with the smallest misspeculation cost C that still meets the pre-fork region size S requirement may be the optimal partition.
When a violation candidate is not moved into the pre-fork region of the partition, all program code that depends on the violation candidate in the next iteration may be executed incorrectly in the speculative thread, and if so would need to be re-executed by the master thread.
The table shown in
If the maximum pre-fork region size is set, for example, at 5, there may be only two possible partitions, as seen in
Next, starting with the root partition, which is the partition having an empty pre-fork region, e.g., partition A in
If the partition P has a pre-fork size smaller than Smax at 408, then the combined misspeculation cost of any nodes in the partition P having a lower topological order number than any of the nodes in the pre-fork region may be estimated in step 410. This cost, C_least, may be the lower bound of the optimal misspeculation cost all of the child partitions of P, because those nodes (having a lower topological order number than any of the pre-fork nodes) may never be moved into the pre-fork region. If C_least is higher than C_best at 412, the partition P may be rejected, and the search may either end at 430 or may return to the parent partition of P at 428. If C_least is smaller than C_best at 412, then, for each node in the post-fork region of P that has a higher topological order number than any node in the pre-fork region and whose predecessors are all in the pre-fork region, a new child partition P′ may be created by moving one such node from the post-fork region into the pre-fork region in block 416. A child partition is defined as a partition having one more node in the pre-fork region than its parent partition (here, P) has.
Each child partition of P may then be searched recursively in block 418, beginning at block 406. When all of the child partitions of P have been searched, the current misspeculation cost of P may be calculated in block 420. If that current misspeculation cost is larger than C_best at 422, the partition P may be rejected. If current misspeculation cost is not larger than C_best, the value of C_best may be updated to equal the current misspeculation cost of P, and partition P may be stored as the current best partition. If there are no other partitions to examine, i.e., if P is the root partition, the process may end at 430.
Once the optimal partition is found, if the partition meets an additional set of criteria, the sequential loop may be transformed into an SPT loop. The criteria may include, for example, but are not limited to, a minimum and a maximum loop size, a maximum ratio of pre-fork region size to loop size, and a maximum ratio of misspeculation cost to loop size. As seen, for example, in
Some embodiments of the invention, as discussed above, may be embodied in the form of software instructions on a machine-accessible medium. Such an embodiment is illustrated in
While various embodiments of the present invention have been described above, it should be understood that they have been presented by way of example only, and not limitation. Thus, the breadth and scope of the present invention should not be limited by any of the above-described exemplary embodiments, but should instead be defined only in accordance with the following claims and their equivalents.