1. Field of the Invention
This invention relates to computing systems, and more particularly, to automatically optimizing parallel execution of multiple work units in a processor by reducing a number of control flow transfer instructions.
2. Description of the Relevant Art
The parallelization of tasks is used to increase the throughput of computer systems. To this end, compilers may extract parallelized tasks from program code to execute in parallel on the system hardware. With single-core architecture, a single core may include deep pipelines configured to perform multi-threading. To further increase parallel execution on the hardware, a multi-core architecture may include multiple processor cores. This type of architecture may be referred to as a homogeneous multi-core architecture and may provide higher instruction throughput than single-core architecture. However, particular instructions for a computationally intensive task may consume a disproportionate share of a shared resource, which may in turn delay deallocation of the shared resource. Examples of such specific tasks may include cryptography, video graphics rendering, and garbage collection.
To overcome the performance limitations of conventional general-purpose cores, a computer system may offload specific tasks to special-purpose hardware. This hardware may include a single instruction multiple data (SIMD) parallel architecture, a field-programmable gate array (FPGA), and/or other specialized types of processing cores. When an architecture includes multiple cores of different types it may be referred to as a heterogeneous multi-core architecture. Depending on the scheduling of tasks, this type of architecture may provide higher instruction throughput than a homogeneous multi-core architecture.
The OpenCL® (Open Computing Language) framework supports programming across heterogeneous computing environments and includes a low-level application programming interface (API) for heterogeneous computing. The OpenCL framework (generally referred to herein as “OpenCL”) includes a C-like language. In the OpenCL framework a function call may be referred to as an OpenCL kernel, or simply a “kernel”. A software kernel may be matched with one or more records of data to produce one or more work units of computation. Generally speaking, a SIMD architecture offers good computing performance and cost efficiency when executing such data parallel workloads. However, performance may be greatly reduced if the parallel workload includes irregular, data-dependent branch behavior. A work unit may be data independent from another work unit, but it may have data dependence within itself. A conditional test implemented as a branch instruction may pass for a first work unit, but fail for a second work unit.
During execution of an If-Then-Else construct statement, within each column of a SIMD architecture is one or more execution units configured to execute the “Then” and the “Else” paths. Current practice includes executing each of the available paths and selectively disabling the execution units corresponding to work units that did not choose the current path. The efficiency of parallel execution may be reduced as the second work unit halts execution and waits in an idle state as the first work unit continues with its ongoing execution during a given pipe stage.
Systems and methods for automatically optimizing parallel execution of multiple work units in a processor by reducing a number of control flow transfer instructions are contemplated.
In one embodiment, a computing system includes a first processor core with a first micro-architecture and a second processor core with a second micro-architecture different from the first micro-architecture. In one embodiment, the first micro-architecture is a general-purpose micro-architecture and the second micro-architecture is a same instruction multiple data (SIMD) micro-architecture. The computing system includes a memory coupled to each of the first and the second processor cores. The memory stores a computer program comprising one or more compute kernels, or function calls. As a compiler traverses the instructions of a given function call, the compiler is configured to identify a control flow transfer instruction, such as a conditional branch. To evaluate the branch, the compiler may utilize one of the first and the second processor cores. One or more records of data may be used to determine one or more outcomes.
Multiple compute sub-kernels may be generated, each comprising code from the function corresponding to a unique outcome of the branch. Multiple work units are produced, each invoked in the compiled computer program by assigning one or more records of data corresponding to a given outcome of the branch to one of the multiple compute sub-kernels associated with the given outcome. The branch may be removed. In one embodiment, the assigning comprises moving said one or more records of data to a same group location in a memory for sequential or stride access. In another embodiment, the assigning comprises remapping access from originally assigned sequential records to said one or more records. A scheduler within an operating system (OS) schedules for execution each of the one or more compute sub-kernels to the first processor core or to the second processor core.
These and other embodiments will be further appreciated upon reference to the following description and drawings.
While the invention is susceptible to various modifications and alternative forms, specific embodiments are shown by way of example in the drawings and are herein described in detail. It should be understood, however, that drawings and detailed description thereto are not intended to limit the invention to the particular form disclosed, but on the contrary, the invention is to cover all modifications, equivalents and alternatives falling within the spirit and scope of the present invention as defined by the appended claims.
In the following description, numerous specific details are set forth to provide a thorough understanding of the present invention. However, one having ordinary skill in the art should recognize that the invention might be practiced without these specific details. In some instances, well-known circuits, structures, and techniques have not been shown in detail to avoid obscuring the present invention.
Referring to
Processing node 110 may also include one or more processing units 170, which may comprise one or more processor cores 172 and data storage buffers 174. Processor core 172 may not be a mirrored silicon image of processor core 112. Processor core 172 may have a micro-architecture different from the micro-architecture used by processor core 112. In one embodiment, the processor core 172 may be a different generation of a same processor family as processor core 112. In another embodiment, the processor core 172 may be a voltage and/or frequency scaled version of processor core 112. In other words, the processor core 172 is not a silicon copy of the processor core 112 with a same functionality and instruction set architecture (ISA), a same clock frequency, same cache sizes, a same memory model, and so forth.
Continuing with the micro-architecture of processor core 172, in yet another embodiment, the processor core 172 may comprise a micro-architecture that provides high instruction throughput for a computational intensive task. Processor core 172 may have a parallel architecture. For example, the processor core 172 may be a single instruction multiple data (SIMD) core. Examples of SIMD cores include graphics processing units (GPUs), digital signal processing (DSP) cores, or other. In one embodiment, the processing node 110 comprises a single instruction set architecture (ISA). Typically, as is well known in the art, single-ISA multi-core architectures have been shown to provide higher power and throughput performances for chip multiprocessors (CMP).
High instruction throughput on processing node 110 may be achieved with measured power consumption within a given power limit when threads of software applications are efficiently scheduled. The threads may be scheduled on one of processor cores 112 and 172 in a manner that each thread has the highest instruction throughput based at least in part on the runtime hardware resources of the processor cores 112 and 172.
Continuing with the components in the processing node 110, the processing node 110 may include memory controller 120, and interface logic 140. In one embodiment, the illustrated functionality of processing node 110 is incorporated upon a single integrated circuit. In one embodiment, processor cores 112 include circuitry for executing instructions according to a predefined general-purpose instruction set. For example, the SPARC® instruction set architecture (ISA) may be selected. Alternatively, the x86, x86-64®, Alpha®, PowerPC®, MIPS®, PA-RISC®, or any other instruction set architecture may be selected. Generally, processor core 112 accesses the cache memory subsystems 114, respectively, for data and instructions. If the requested block is not found in cache memory subsystem 114 or in shared cache memory subsystem 118, then a read request may be generated and transmitted to the memory controller within the node to which the missing block is mapped.
In one embodiment, processing unit 170 is a graphics processing unit (GPU). Modern GPUs are very efficient at manipulating and displaying computer graphics. The highly parallel structure of GPUs makes them more effective than general-purpose central processing units (CPUs), such as processing unit 115, for a range of complex algorithms. Typically, a GPU executes calculations used for graphics and video and a CPU executes calculations for many more system processes than graphics alone. Conventional GPUs utilize very wide single instruction multiple data (SIMD) architectures to achieve high throughput in image-rendering applications. Such applications generally entail executing the same programs, such as vertex shaders or pixel shaders, on large numbers of objects (vertices or pixels). Since each object is processed independently of other objects, but the same sequence of operations is used, a SIMD architecture provides considerable performance enhancement. GPUs have also been considered for non-graphical calculations.
In one embodiment, the GPU 170 may be located on a video card. In another embodiment, the GPU 170 may be integrated on the motherboard. In yet another embodiment, the illustrated functionality of processing node 110 may be incorporated upon a single integrated circuit. In such an embodiment, the CPU 115 and the GPU 170 may be proprietary cores from different design centers. Also, the GPU 170 may now be able to directly access both local memories 114 and 118 and main memory via memory controller 120 from the processing node 110, rather than perform memory accesses off-chip via interface 140. This embodiment may lower latency for memory accesses for the GPU 170, which may translate into higher performance.
Continuing with the components of processing node 110 in
Generally, packet processing logic 116 is configured to respond to control packets received on the links to which processing node 110 is coupled, to generate control packets in response to processor cores 112 and/or cache memory subsystems 114, to generate probe commands and response packets in response to transactions selected by memory controller 120 for service, and to route packets for which node 110 is an intermediate node to other nodes through interface logic 140. Interface logic 140 may include logic to receive packets and synchronize the packets to an internal clock used by packet processing logic 116.
Tuning now to
For a data-parallel software application, an N-Dimensional computation domain may define an organization of an “execution domain”. The N-Dimensional computation domain may also be referred to as an N-Dimensional grid or an N-Dimensional Range (“NDRange”). The NDRange may be a one-, two-, or three-dimensional space. This dimensional space may also be referred to as an index space. For example, a software application may perform data processing on a two-dimensional (2D) array of data, such as an image file. The software application may perform an algorithm developed by a software programmer on a pixel-by-pixel basis of a 2D image. A given compute kernel may be invoked over the index space (the NDRange).
Typically after compilation, the arguments and parameters of each compute kernel are set. Additionally, associated memory objects and buffers are created. A given instance of the compute kernel may be executed as its own software thread. However, a compute kernel may include control flow transfer instructions that create forks, whereas a fork in a computer program typically creates a software thread, by common definition. A given instance of the compute kernel at a given point in the index space may be referred to as a work unit or work item. A work unit may operate with the one or more instructions in the compute kernel on a record of data corresponding to a given pixel (a given index) of the 2D image. Typically, work units have an associated unique identifier (ID). In another example, an introductory computer program processing the string “Hello World” may have one work unit for computing each letter in the string.
The NDRange may define a total number of work units that execute in parallel if there is sufficient hardware support. For example, the NDRange may define a number of 280 work units, but a GPU may support the simultaneous execution of 64 work units at any given time. The total number of work units may define a global work size. As is well known to those skilled in the art, the work units may be further grouped into work groups. Each work group may have a unique identifier (ID). The work units within a given work group may be able to communicate with each other and synchronize execution and coordinate memory accesses. A number of work units may be clustered into a wave front for simultaneous execution on a GPU in a SIMD manner. Regarding the example above for 280 total work units, a wave front may include 64 work units.
The OpenCL framework is an open programming standard for various compute devices, or OpenCL devices. A software programmer may avoid writing a vendor-specific code, which may result in improved code portability. Other frameworks are available and may offer more vendor-specific coding for heterogeneous architectures. For example, NVIDIA offers Compute Unified Device Architecture (CUDA®) and AMD offers ATI Stream®. With a CUDA framework, a compute kernel is typically statically compiled when the computer program is compiled. With an OpenCL framework, a compute kernel is typically compiled with a Just-In-Time (JIT) method. The JIT method may generate an appropriate binary code after obtaining the system configuration. With a JIT compilation method, the compilation time is included with the total execution time. Therefore, compiler optimizations may increase the execution time. In addition, at run time the OpenCL compiler generates multiple versions of compute kernels. One version of a compute kernel may be generated for each type of OpenCL device type, such as a general-purpose CPU, a SIMD GPU, and so forth.
The two frameworks, OpenCL and CUDA, have a difference in terminology between their respective execution models. For example, a work unit, a work group, a wave front and an NDRange in OpenCL have corresponding terms in CUDA such as a thread, a thread block, a warp and a grid. Throughout the rest of the description, the terms corresponding to OpenCL are used. However, the systems and methods described may apply to CUDA, ATI Stream and other frameworks.
As shown in
The OpenCL framework may invoke an instance of a compute kernel multiple times in parallel. Each call to the compute kernel has one associated unique ID (a work unit ID) that may be fetched by calling an internal function named get_global_id(0). Regarding the above example in code 220, the compute kernel “Power2” is invoked once for each data value in the INPUT array. In this case, the compute kernel “Power2” is invoked 10 times. Accordingly, ten unique work unit IDs are fetched. With a JIT compiling method, these instances are invoked at runtime. The OpenCL framework may differentiate between these different instances by utilizing the unique work unit IDs. The data to be operated on (a record) may also be specified, such as a specific data value in the INPUT array. Therefore, at runtime, a work unit may be scheduled by default to a same OpenCL device as the associated compute kernel is scheduled.
Tuning now to
Turning now to
Each of the compute kernels shown in
In one embodiment, hardware computing system 410 incorporates a general-purpose processor core 112 and a SIMD processor core 172, each configured to process one or more work units. In another embodiment, system 410 includes two other heterogeneous processor cores. In general, for a given application, operating system 420 sets up an address space for the application, loads the application's code into memory, sets up a stack for the program, branches to a given location inside the application, and begins execution of the application. Typically, the portion of the operating system 420 that manages such activities is the operating system (OS) compute kernel 422. The OS compute kernel 422 is referred to as “OS compute kernel” in order not to confuse it with a compute kernel, or a function call. The OS Compute kernel 422 may further determine a course of action when insufficient memory is available for the execution of the application. As stated before, an application may be divided into more than one compute kernel and system 410 may be running more than one application. Therefore, there may be several compute kernels running in parallel. The OS Compute kernel 422 may decide at any time which of the simultaneous executing compute kernels is allocated to the processor cores 112 and 172. The OS Compute kernel 422 may allow a process to run on a core of a processor, which may have one or more cores, for a given amount of time referred to as a time slice. An OS scheduler 424 in the operating system 420 may comprise decision logic for assigning compute kernels to cores.
In one embodiment, only one compute kernel can execute at any time on any one of the hardware computation units 412a-412g and 412h-412r. These hardware computation units comprise hardware that can handle the execution of a given instruction of a given work unit with associated data. This hardware may include an arithmetic logic unit that is configured to perform addition, multiplication, zero detect, a bit-wise shift, division, video graphics and multimedia instructions or other operations known to those skilled in the art of processor design. These hardware computation units may include a hardware thread in a multi-threaded processor, a parallel hardware column in a SIMD micro-architecture, and so forth.
The dashed lines in
Referring to
Each of the cores 510 and 560 include a cache memory subsystem 530. As shown, the general-purpose core 510 logically has the cache memory subsystem 530 separate from the control logic 520 and the arithmetic logic units (ALUs) 540. The data flow within the core 510 may be pipelined, although storage elements, such as pipeline registers, are not shown in order to simplify the illustration. In a given pipeline stage, an ALU may be unused if instructions in this stage do not utilize a certain type of ALU or if another work unit (or another thread for a general-purpose core) consumes the ALUs during this stage.
As shown, the SIMD core 560 has the cache memory subsystem 530 grouped with control logic 520 for each row of computation units 542. The data flow within the core 560 may be pipelined, although storage elements, such as pipeline registers, are not shown in order to simplify the illustration. In a given pipeline stage, a computation unit may be unused if an associated instruction in this stage is not executed based on a previous failed test, such as a not-taken branch.
Referring now to
In a balanced scheme, each of the instructions 602-608 consume an equal amount of resources each stage. However, typically, a general-purpose core does not replicate resources for each instruction due to real-estate cost, power consumption and other design considerations. Therefore, the workload may become unbalanced. For example, the instruction 606 may consume more resources for one or more pipe stages due to its computation intensive behavior. As shown, the resources 630 consumed by this instruction may become far greater than the resources consumed by other instructions. In fact, the computation intensive instruction may block the usage of hardware resources by other instructions.
Some computation intensive tasks may place pressure on shared resources within the general-purpose core 112. Thus, throughput losses occur for both the computational intensive process and other processes waiting for the shared resources. In addition, some instructions occupy the shared resource and other resources on the die to support the computation being performed on the shared resource. Such a long latency instruction may concurrently block other processes from using several resources during a long latency.
Referring now to
Each of the computation units within a given row may be a same computation unit. Each of these computation units may operate on a same instruction, but different data associated with a different work unit. As shown, some of the work units pass the test provided by the branch instruction 704 and other work units fail the test. The SIMD core 172 may execute each of the available paths and selectively disable the execution units, such as the computation units, corresponding to work units that did not choose the current path. For example, during execution of an If-Then-Else construct statement, within each column of a SIMD architecture are execution units configured to execute the “Then” (Path A) and the “Else” (Path B) paths. The efficiency of parallel execution may be reduced as the first and the second work units halt execution and wait as the third work unit continues with its ongoing execution. Therefore, not all of the computation units are active computation units 710 in a given row after execution of the branch instruction 704. If a large number of computation units are inactive during a given pipe stage, the efficiency and throughput of the SIMD core is reduced.
Referring now to
Only one of the two compute kernels in code 232 is executed during runtime for a given group of records. For example, the conditional IF statement may evaluate to “true” if a given record has an even data value. Similarly, the conditional IF statement may evaluate to “false” if a given record has an odd data value. For a given array with data values 1 to 10, the even records {2, 4, 6, 8, 10} may evaluate to true and the function doWorkA is executed. The odd records {1, 3, 5, 7, 9} may evaluate to false and the function doWorkB is executed. After the code transformation from code 230 to code 232, scheduling may include combining the function “KernelFunctionA” with the even records {2, 4, 6, 8, 10} to generate five work units. Similarly, scheduling may include combining the function “KernelFunctionB” with the odd records {1, 3, 5, 7, 9} to generate five additional work units. In one embodiment, the combining of the compute kernel code with a record of data may be performed by shuffling, or rearranging, the records in memory into groups. In another embodiment, this combination may be performed by creating an index array that maps sequential or stride indices to scattered actual locations in memory. Further details of both embodiments are provided later below.
Referring now to
After transformation, the application code 912 may include the components used in the application code 910, but without the conditional IF statements 960 and 936. In addition, each of the THEN Paths 970 and 938 and the ELSE Paths 980 and 940 may be altered to include a surrounding function call that inputs the proper records. A similar transformation is shown in code 232. Again, the combining of the compute kernel code with a record of data may be performed by shuffling, or rearranging, the records in memory into groups. Alternatively, this combination may be performed by creating an index array that maps sequential or stride indices to scattered actual locations in memory.
Turning now to
Turning now to
In block 1102, a software program or subroutine may be located and analyzed. This software program may be written for compilation and execution on a heterogeneous multi-core architecture. Program code may refer to any portion of a software application, subroutine, dynamic linked library, or otherwise. A pathname may be entered at a command prompt by a user, a pathname may be read from a given directory location, or elsewhere, in order to begin compiling the source code. The program code may be written by a designer in a high-level language such as C, a C-like language such as OpenCL™, and so forth. In one embodiment, the source code is statically compiled. In such an embodiment, during a static front-end compilation, the source code may be translated to an intermediate representation (IR). A back-end compilation step may translate the IR to machine code. The static back-end compilation may perform various transformations and optimizations. In another embodiment, the source code is compiled with a Just-In-Time (JIT) method. The JIT method may generate an appropriate binary code after obtaining the system configuration. With either method, the compiler may identify a compute kernel in the program code.
In block 1104, the compiler may read one or more instructions of the compute kernel and analyze them. A conditional statement may be a control flow transfer instruction, such as a branch. Different types of control flow transfer instructions may include forward/backward branches, direct/indirect branches, jumps, and so forth. It may be possible for a compiler or other tool to statically determine a direction of a branch and/or a target of a branch. However, in one embodiment, some processing typically performed during runtime on associated data may be performed during compilation. For example, a simple test to determine a direction (taken, not-taken) of a branch may be performed. Although, compilation may be referred to as “static compilation”, one or more dynamic operations may be performed. This compilation may also be referred to as “pre-runtime compilation”. Another example of a dynamic step performed at this time is identifying a next instruction to execute in each of a THEN, ELSE IF and ELSE blocks of an If-Then-ElseIf-Else construct.
If a conditional statement is not identified (conditional block 1106), then in block 1110, any remaining analysis and instrumentation is completed and work units are scheduled for runtime execution. If a conditional statement is identified (conditional block 1106), and data is available for pre-runtime evaluation (conditional block 1112), then in block 1114, the access of data for runtime execution of compute kernels is altered based on the branch results. For example, the combining of the compute kernel code with a record of data may be performed by shuffling, or rearranging, the records in memory into groups. Alternatively, this combination may be performed by creating an index array that maps sequential or stride indices to scattered actual locations in memory.
Turning now to
The original records order 1240 may not yield optimal parallel execution of multiple work units generated from compute kernels. Therefore, as described in block 1112 of method 1100, the code for compute kernels may be analyzed. Given instructions may be evaluated with associated records of data. Based on the results, the records may be rearranged in memory to provide optimal parallel execution of the generated work units. In one embodiment, within a given group of records, each associated work unit may return a same result for one or more conditional instructions like a branch.
The records 1222 and 1226 may provide the same results for two branches in the example shown. Here, each of records 1222 and 1226 may fail each of the two branches, as do other records in the arrangement 1260. Therefore, each of records 1222 and 1226 may be moved to data group 1250. Similarly, each of records 1224 and 1228 may fail a first branch and pass a second branch, as do other records in the arrangement 1280. Therefore, each of records 1224 and 1228 may be moved to data group 1270.
The work units associated with data group 1250 may be scheduled together for execution. The work units associated with data group 1270 may be scheduled together for execution after the work units associated with data group 1250. In a simple example, originally record 1222 may have been associated with a work unit ID 0, record 1224 may have been associated with a work unit ID 1, record 1226 may have been associated with a work unit ID 2, and so forth. However, after the data shuffling, the record 1222 may still be associated with a work unit ID 0, but record 1226 may be associated with a work unit ID 1. A record (not shown) following record 1226 in data group 1250 may be associated with a work unit ID 2, and so forth. The record 1224 may be associated with a work unit ID following a work unit ID for a last record in data group 1250. Referring again to
As shuffling of data in memory may include copying large amounts of data, an alternative method includes creating an index array and accessing the data via the index array. Referring now to
In block 1402, an index N may be reset to 0 and the code of a given compute kernel may be analyzed. The index N may maintain a count of branch instructions. In block 1404, a direction of a detected branch instruction may be evaluated. For a given conditional statement in a compute kernel, the branch results 1304 are as shown, wherein a binary 1 indicates “Taken/Pass” and a binary “0” indicates “Not-Taken/Fail”. In other examples, the indications of the binary values may be switched. A conditional statement in the compute kernel code may include an “EvaluateFunction” as shown in code 230 in
In one embodiment, in order to efficiently determine a mapping 1320 between work unit IDs 1310 and a created index array 1312 used for actual parallel execution of the work units, a prefix sum technique may be used for parallel processing. Generally speaking, a prefix sum is an operation on lists in which each element in the result list is obtained from the sum of the elements in an operand list up to its index. For example, an input array with n values, {x0, . . . , xn-1}, may be used to produce an output array {y0, y1, y2, . . . , yn-1} where {y0=x0, y1=y0+x1, . . . yi=yi-1+xi}. This prefix sum 1306 may be stored in a sum array. The prefix sum technique may utilize one or more instructions already supported by a processor. In block 1408, the prefix sum 1306 may be generated from the branch results 1304.
In block 1410, an index array may be determined for each partition found in block 1406. In one embodiment, index arrays may be generated by utilizing the algorithm 1510 as shown in
Referring to
For each taken branch indicated as a binary “1” in the branch results 1304, the “Then” path of the algorithm 1510 may be executed. Here, an index is set to one less than a prefix sum value associated with a given taken branch and a record that caused the taken direction. An index array associated with a given partition determined in step 1406 of method 1400 is updated with an ID of the taken branch. In one embodiment, the ID is the ID of the record that produced the taken direction of the given branch.
For each not-taken branch indicated as a binary “0” in the branch results 1304, the “Else” path of the algorithm 1510 may be executed. Here, an index is set to the record ID value less the value of a prefix sum value associated with a given taken branch and the record that caused the not-taken direction. An index array associated with a different partition than the partition described above for the “Then” path and determined in step 1406 of method 1400 is updated with an ID of the taken branch. When each of the records is traversed, an associated index array is constructed for each partition.
Referring to
Turning now to
With index array generation and subsequent remapping of the access of records of data during execution, the computation units within a work group are enabled without reshuffling data in memory. Some processors may contain a prefix sum instruction that can be used to accelerate the generation process. In such an embodiment, the data is not reshuffled back into an original order once the computation is complete. However, in some embodiments the generated index arrays may be used to reshuffle the data in memory and after execution of the compute kernels and compute sub-kernels, the index arrays may be used to return the data to original locations. The reshuffled data may be more coalesced, or compact, in memory. Coalesced data typically provides better performance on GPUs that may have no, or limited, caching mechanisms. Accordingly, the benefit of increased performance during execution may outweigh the cost of reshuffling records of data in memory. The generated index arrays may be used to rearrange the record data into a different memory layout, such as changing a row-oriented arrangement into a column-oriented arrangement, or vice-versa.
In one embodiment, the compiler may analyze the control flow test decisions of a compute kernel and produces compute sub-kernels as shown above in
In another embodiment, an architecture with a low-cost compute kernel dispatch and memory sharing between a CPU and a GPU may have the CPU execute the control flow graph, and have a corresponding compute kernel for each basic block of the control flow graph. The CPU may be in charge of dispatching the proper compute kernels, which do not have control flow, at each decision point in the control flow graph.
It is noted that the above-described embodiments may comprise software. In such an embodiment, the program instructions that implement the methods and/or mechanisms may be conveyed or stored on a computer readable medium. Numerous types of media which are configured to store program instructions are available and include hard disks, floppy disks, CD-ROM, DVD, flash memory, Programmable ROMs (PROM), random access memory (RAM), and various other forms of volatile or non-volatile storage. Generally speaking, a computer accessible storage medium may include any storage media accessible by a computer during use to provide instructions and/or data to the computer. For example, a computer accessible storage medium may include storage media such as magnetic or optical media, e.g., disk (fixed or removable), tape, CD-ROM, or DVD-ROM, CD-R, CD-RW, DVD-R, DVD-RW, or Blu-Ray. Storage media may further include volatile or non-volatile memory media such as RAM (e.g. synchronous dynamic RAM (SDRAM), double data rate (DDR, DDR2, DDR3, etc.) SDRAM, low-power DDR (LPDDR2, etc.) SDRAM, Rambus DRAM (RDRAM), static RAM (SRAM), etc.), ROM, Flash memory, non-volatile memory (e.g. Flash memory) accessible via a peripheral interface such as the Universal Serial Bus (USB) interface, etc. Storage media may include microelectromechanical systems (MEMS), as well as storage media accessible via a communication medium such as a network and/or a wireless link.
Additionally, program instructions may comprise behavioral-level description or register-transfer level (RTL) descriptions of the hardware functionality in a high level programming language such as C, or a design language (HDL) such as Verilog, VHDL, or database format such as GDS II stream format (GDSII). In some cases the description may be read by a synthesis tool which may synthesize the description to produce a netlist comprising a list of gates from a synthesis library. The netlist comprises a set of gates which also represent the functionality of the hardware comprising the system. The netlist may then be placed and routed to produce a data set describing geometric shapes to be applied to masks. The masks may then be used in various semiconductor fabrication steps to produce a semiconductor circuit or circuits corresponding to the system. Alternatively, the instructions on the computer accessible storage medium may be the netlist (with or without the synthesis library) or the data set, as desired. Additionally, the instructions may be utilized for purposes of emulation by a hardware based type emulator from such vendors as Cadence®, EVE®, and Mentor Graphics®.
Although the embodiments above have been described in considerable detail, numerous variations and modifications will become apparent to those skilled in the art once the above disclosure is fully appreciated. It is intended that the following claims be interpreted to embrace all such variations and modifications.