One or more embodiments of the invention relate generally to the field of multi-thread micro-architectures. More particularly, one or more of the embodiments of the invention relates to a method and apparatus for an automatic thread-partition compiler.
Hardware multi-threading is becoming a practical technique in the modern processor design. Several multi-threaded processors have already been announced in the industry or are in production in the areas of high-performance computing, multi-media processing and network packet processing. The Internet exchange processor (IXP) series, which belong to the Intel® Internet Exchange™ Architecture (IXA) Network Processor (NP) family, are such examples of multi-threaded processors. In general, each IXP includes a highly parallel, multi-threaded architecture in order to meet the high-performance requirements of packet processing.
Generally, NPs are specifically designed to perform packet processing. Conventionally, NPs may be used to perform such packet processing as a core element of high-speed communication routers. Generally, traditional network applications for performing packet processing are conventionally coded using sequential semantics. Generally, such network applications are coded to use a unit of packet processing (a packet processing stage (PPS)) that runs forever. Hence, when a new packet arrives, the PPS performs a series of tasks (e.g., receipt of the packet, routing table look-up and enqueuing of the packet). Consequently, a PPS is usually expressed as an infinite loop (or a PPS loop) with each iteration processing a different packet.
Hence, in spite of the highly parallel, multi-threaded architecture provided by modern NPs, failure to exploit such parallelism results in highly unused processor resources. Undoubtedly, poor performance gain can be achieved if a sequential application program runs on top of the advance multi-threaded architectures provided by NPs. In order to achieve high-performance, programmers have tried to fully utilize the multi-threaded architecture provided by NPs by exploiting the thread level parallelism of sequential applications. Unfortunately, manually threaded partitioning is a challenge for most programmers.
The various embodiments of the present invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:
A method and apparatus for an automatic thread-partition compiler are described. In one embodiment, the method includes the transformation of a sequential application program into a plurality of application program threads. Once partitioned, the plurality of application program threads are concurrently executed as respective threads of a multi-threaded architecture. Hence, a performance improvement of the parallel multi-threaded architecture is achieved by hiding memory access latency through or by overlapping memory access with computations or with other memory accesses.
In the following description, certain terminology is used to describe features of the invention. For example, the term “logic” is representative of hardware and/or software configured to perform one or more functions. For instance, examples of “hardware” include, but are not limited or restricted to, an integrated circuit, a finite state machine or even combinatorial logical. The integrated circuit may take the form of a processor such as a microprocessor, application specific integrated circuit, a digital signal processor, a micro-controller, or the like.
An example of “software” includes executable code in the form of an application, an applet, a routine or even a series of instructions. The software may be stored in any type of computer or machine readable medium such as a programmable electronic circuit, a semiconductor memory device inclusive of volatile memory (e.g., random access memory, etc.) and/or non-volatile memory (e.g., any type of read-only memory “ROM,” flash memory), a floppy diskette, an optical disk (e.g., compact disk or digital video disk “DVD”), a hard drive disk, tape, or the like.
In one embodiment, the present invention may be provided as an article of manufacture which may include a machine or computer-readable medium having stored thereon instructions which may be used to program a computer (or other electronic devices) to perform a process according to one embodiment of the present invention. The computer-readable medium may include, but is not limited to, floppy diskettes, optical disks, Compact Disc, Read-Only Memory (CD-ROMs), and magneto-optical disks, Read-Only Memory (ROMs), Random Access Memory (RAMs), Erasable Programmable Read-Only Memory (EPROMs), Electrically Erasable Programmable Read-Only Memory (EEPROMs), magnetic or optical cards, flash memory, or the like.
In the embodiment illustrated, ICH 160 is coupled to I/O bus 172 which couples a plurality of I/O devices, such as, for example, PCI or peripheral component interconnect (PCI) devices 170, including PCI-express, PCI-X, third generation I/O (3GIO), or other like interconnect protocol. Collectively, MCH 120 and ICH 160 are referred to as chipset 180. As is described herein, the term “chipset” is used in a manner well known to those skilled in the art to describe, collectively, the various devices coupled to CPU 110 to perform desired system functionality. In one embodiment, main memory 140 is volatile memory including, but not limited to, random access memory (RAM), synchronous RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate (DDR) SDRAM (DDR SDRAM), Rambus DRAM (RDRAM), direct RDRAM (DRDRAM), or the like.
System
In contrast to conventional computer systems, computer system 100 includes thread partitioning compiler 200 for partitioning a sequential application program into a plurality of application program threads (“thread-partitioning”). Hence, compiler 200 may bridge the gap between the multi-threaded architecture of network processors and the sequential programming model used to code conventional network applications. One way to address this problem is to exploit the thread level parallelism of sequential applications. Unfortunately, manually thread-partitioning a sequential application is a challenge for most programmers. In one embodiment, thread-partition compiler 200 is provided for automatically thread-partitioning a sequential network application, as illustrated in
Referring to
In one embodiment, a sequential PPS infinite loop 282 is transformed into multiple program-threads (300-1, 300-2) with optimized synchronization between the program-threads to achieve improved parallel execution. Unfortunately, sequential PPS loops (e.g., 282) include the sequential execution of dependent operations including, for example, loop carried variables. As described herein, a loop carried variable is a data dependent relation from one iteration to another iteration of a loop. In one embodiment, a loop carried variable includes two properties: (i) the loop carried variable is alive on the back edge of a PPS loop of the sequential PPS; and (ii) the value of the loop carried variable is changed in the PPS loop body. Representatively, the variable i 286 represents a loop carried variable within a critical section 284 of PPS loop 282, as is described in further detail below.
In one embodiment, critical sections of a sequential PPS are identified by surrounding boundary instructions (302 and 304), as illustrated in
In one embodiment, a level of parallelism is increased by reducing the amount of instructions contained within critical sections. Accordingly, by minimizing critical section code, the amount of code fragments requiring execution in strict sequential thread order is minimized. Once loop carried variables and dependent operations are detected, critical sections are demarcated by boundary instructions. Hence, in one embodiment, preparation work for performing thread-partitioning includes identification of all loop carried variables. In one embodiment, the identification of loop carried variables is performed by an input program and therefore additional details regarding detection of loop carried variables are omitted to avoid obscuring the details of the embodiments of the invention.
In one embodiment, thread-partitioning compiler 200 (
In the program representations illustrated in
In one embodiment, network processor 400 is, for example, implemented within processor 100 of
Operation
As illustrated in
At process block 510, nodes of the CFG loop are updated to enclose identified critical sections of the sequential application program within pairs of boundary instructions. In one embodiment a pair of AWAIT and ADVANCE operations are initially inserted at a top 604 and a bottom 606 of a CFG loop 610 of
Accordingly, at process block 518, process blocks 514-516 are repeated for each identified critical section of the sequential application program. For example, as illustrated with reference to
Accordingly, once each pair of AWAIT and ADVANCE operations are inserted into CFG loop 610 for all identified critical sections of the sequential application program, code motion is performed on the CFG loop to reduce the amount of operations contained within critical sections identified by AWAIT and ADVANCE operations. However, those skilled in the art recognize that code minimization within critical sections may be performed using other data analysis or graph theory techniques, while remaining within the embodiments of the described invention.
As described herein, dataflow analysis is not limited to simply computing definitions and uses of variables (dataflow). Dataflow analysis provides a technique for computing facts about paths through programs or procedures. A prerequisite to the concept of dataflow analysis is the control flow graph (CFG) or simply a flow graph, for example as illustrated with reference to
As described herein, code motion is a technique for inter-block and intra-block instruction reordering (hoisting/sinking). In one embodiment, code motion moves irrelevant code out of identified critical sections in order to minimize the amount of instructions/operations contained therein. To perform the inter-block and intra-block instruction reordering, code motion initially identifies motion candidate instructions. In one embodiment, motion candidate instructions are identified using dataflow analysis. Representatively, a series of dataflow problems are solved to carryout both hoisting and sinking of identified motion candidate instructions.
In one embodiment, a three-phase code motion is used to minimize the amount of operations within identified critical sections of thread-partition loops. Representatively, the first two phases of code motion perform code motion with the AWAIT operations fixed and ADVANCE operations fixed, respectively. As a result, ADVANCE operations are placed into the optimal basic block and AWAIT operations are placed into the optimal basic block. In this embodiment, the last phase of code motion performs code motion with both AWAIT operations and ADVANCE operations fixed. Representatively, this final phase of code motion moves irrelevant instructions out of critical sections, while placing both AWAIT and ADVANCE operations at optimal positions.
At process block 550, it is determined whether hoist instructions are no longer detected. Until such is the case, motion candidate instructions are hoisted within the basic blocks of the CFG loop at process block 551. At process block 551, instructions in a source basic block of the CFG loop are hoisted according to a dependence graph of the sequential application program. As descried herein, a dependence graph is constructed to illustrate data dependence of a PPS loop body to provide information about data dependence. Hence, hoisting or sinking any motion candidate instructions cannot violate data dependence on the original program. As descried herein, a dependence graph illustrates data dependence between nodes and control dependence between nodes.
As illustrated with reference to
At process block 542, the computed hoist instructions are hoisted into a corresponding basic block into which the computed hoist instruction may be placed. At process block 544, it is determined whether additional code motion is detected, such as for example, a change detected by hoisting of the computed hoist instructions. When a change is detected, at process block 544, the current block's predecessors from the CFG loop are enqueued into the hoist queue at process block 546. At process block 548, process blocks 538-546 are repeated until the hoist queue is empty.
Path predicates are statements about what happens during program execution along a particular control path quantified over all such paths, either universally or existentially. As described herein, a control flow path is a path in the CFG loop. For example, a reaching definitions problem asks for each control flow node n and each variable definition d, whether d might reach n, where reach means the definition gives a value to a variable and the variable is not then redefined. For example, a path predicate may be expressed according to the following equation.
REACHDEF(node n, definition d)=there exists a path p, from start to n, such that d occurs on p, and no definition occurs after d (1)
As described herein, dataflow equations formulate answers to a path predicate as a system of equations describing the solution at each node. That is, for each node in the CFG, we are able to say yes or no regarding whether the definition of interest reaches the node. For example, consider any single three address code statement in the form of:
di:x:=y op z (2)
This program statement defines the variable x. Accordingly, if such a statement were contained within the node of a control flow graph, as described herein, the node (N) of the control flow graph containing the program statement is set to generate definition di and kill any other definitions within prior program statements that define x. When analyzed in terms of sets, the following relationships are established:
gen[N]=(di)
kill[N]=Dx−(di) (3)
where dx refers to all other definitions of x in the program.
Accordingly, considering a basic block N, figuring out which definitions reach the basic block n requires analysis of predecessors of basic block N. For example, letting the symbol represent the predecessor relation on two nodes in a CFG, we say that p is a predecessor of b if there is an edge from p→b in the control flow graph. Accordingly, based on the predecessor relation, the following dataflow equations are generated.
out[B]=gen[B]∪(in[B]−kill[B]) (5)
Accordingly, in order to compute motion candidates through dataflow analysis and in accordance with one embodiment of the invention, for each instruction i the following dataflow equations can be described as follows:
GEN[i]={N|AWAIT(N) if i is AWAIT} (6)
KILL[i]={N|AWAIT(N) if i is ADVANCE} (7)
OUT[i]=GEN[i]∪(IN[i]−KILL[i]) (9)
Accordingly, in one embodiment, motion candidates are computed as follows: (1) AWAITs are identified as motion candidates; (2) an instruction i is a candidate only if IN[i] is not equal to an empty set (0). In other words, for each instruction i, a bit vector is generated according to the dataflow equations in order to determine whether the instruction i is to be identified as a motion candidate. In the embodiment described, ADVANCE operations are not identified as motion candidates.
Referring again to
In other words, the detection and sinking of sink instructions, as well as hoist instructions, should not violate any data dependencies or, for example, control dependencies, indicated by a dependence graph of the sequential application program. In other words, compliance with a dependence graph ensures that program-threads generated from the sequential application program. In one embodiment, program-threads maintain sequential semantics of the original program and enforce dependencies between program-thread iterations corresponding to a PPS loop of the sequential application program.
In one embodiment, a host queue is initialized with basic blocks of the CFG loop. In one embodiment, the basic blocks are ordered based on a topological order in the CFG loop. At process block 586, motion candidate instructions are hoisted among the basic blocks until hoist instructions are no longer detected. At process block 588, detected hoist instructions are hoisted within basic blocks that contain AWAIT instructions based on a dependence graph of the sequential application program to preserve the original program order.
In one embodiment, process block 588 describes intra-block hoisting. In such an embodiment, motion candidates, excluding both AWAIT operations and ADVANCE operations, are hoisted in the basic blocks, which contain AWAIT operations as high as possible without violating the dependence graph. In one embodiment, an instruction is that is hoisted outside of an outmost critical section is no longer regarded as a motion candidate. For example, as illustrated with reference to
Accordingly, in one embodiment, a thread-partition compiler provides automatic multi-thread transformation of a sequential application program using a three-phase code motion to achieve increased parallelism. Within a multi-threaded architecture, only one thread is alive at any one time. Hence, line rate of a network packet processing stage can be highly improved by hiding memory latency in one thread by overlapping memory access with computations or their memory accesses performed by another thread.
Alternate Embodiments
Several aspects of one implementation of the thread-partition compiler for providing multiple program-threads have been described. However, various implementations of the thread-partition compiler provide numerous features including, complementing, supplementing, and/or replacing the features described above. Features can be implemented as part of the compiler or as part of a hardware/software translation process in different embodiment implementations. In addition, the foregoing description, for purposes of explanation, used specific nomenclature to provide a thorough understanding of the embodiments of the invention. However, it will be apparent to one skilled in the art that the specific details are not required to practice the embodiments of the invention.
In addition, although an embodiment described herein is directed to a thread-partition compiler, it will be appreciated by those skilled in the art that the embodiments of the present invention can be applied to other systems. In fact, data analysis or graph theory techniques for performing code motion within critical sections fall within the embodiments of the present invention, as defined by the appended claims. The embodiments described above were chosen and described to best explain the principles of the embodiments of the invention and its practical applications. These embodiments were chosen to thereby enable others skilled in the art to best utilize the invention and various embodiments with various modifications as are suited to the particular use contemplated.
It is to be understood that even though numerous characteristics and advantages of various embodiments of the present invention have been set forth in the foregoing description, together with details of the structure and function of various embodiments of the invention, this disclosure is illustrative only. In some cases, certain subassemblies are only described in detail with one such embodiment. Nevertheless, it is recognized and intended that such subassemblies may be used in other embodiments of the invention. Changes may be made in detail, especially matters of structure and management of parts within the principles of the embodiments of the present invention to the full extent indicated by the broad general meaning of the terms in which the appended claims are expressed.
Having disclosed exemplary embodiments and the best mode, modifications and variations may be made to the disclosed embodiments while remaining within the scope of the embodiments of the invention as defined by the following claims.