The present invention relates generally to the field of computer hardware accelerators. More particularly, it concerns coarse grain reconfigurable array accelerator performance and efficiency.
1. Summary of the Prior Art
Accelerators are now widely accepted as an inseparable part of computing fabric. Special purpose, custom hardware accelerators have been shown to achieve the highest performance with the least power consumption (Chung & Milder, 2010). However, they are not programmable and incur a high design cost. On the other hand Graphics Processing Units or GPUs, although programmable, are limited to accelerating only parallel loops (Betkaoui, 2010). Field Programmable Gate Arrays (FPGAs) have some of the advantages of hardware accelerators and are also programmable (Che, et al., 2008). However, their fine-grain re-configurability incurs a very high cost in terms of energy efficiency (Theodoridis, et al., 2007).
Coarse Grain Reconfigurable Arrays (CGRAs) are programmable accelerators that promise high performance at low power consumption (A. C., et al., 2007). The ADRES CGRA (F. B., et al., 2008) has been shown to achieve performance and power efficiency of up to 60 GOPS/W @ 90 nm technology node. Some CGRAs are an array of processing elements (PE) which are connected with each other through an interconnection network, such as the CGRA 100 shown in
One of the major challenges associated with CGRAs is that of accelerating loops with if-then-else structures. Hamzeh et al., 2014 show that it is important to accelerate loops with if-then-else constructs because many long running loops in important applications have if-then-else constructs. Since the result of the conditional is known only at run time, existing solutions handle loops with if-then-else in CGRAs by predication (Mahlke & Lin; Mahlke, 1995; Han, et al., 2013; Chan & Choi, 2008). The partial predication and full predication schemes, for example, execute code in the “if” block and the “else” block of an if-then-else and then selects which branch outputs to use later. These techniques execute instructions from both the paths of an if-then-else structure and then commit the results of only the instructions from the path taken by the branch at run time. While predication allows for correct execution, it results in doubling the resource usage—and therefore inefficient execution. Dual-issue schemes (Han, et al., 2010; Han, et al., 2013; Hamzeh, et al., 2014) try to improve this by fetching the instructions from both paths but only executing instructions from the correct path. They achieve higher performance and efficiency, but at the cost of increased instruction fetch bandwidth—they have to fetch 2 instructions per PE every cycle.
2. Background and Related Work
Loop kernels are the most desirable parts of the program to be accelerated in a CGRA (Rau, et al., 1994). Most of the computational loop kernels have if-then-else structures in them (Hamzeh, et al., 2014), hence accelerating such loops is important to have an effective loop acceleration in CGRAs. Consider loop kernel 200 and 200a with if-then-else structures as shown in
In a partial predication scheme (Han, et al., 2013; Mahlke, et al., 1992), the if-path 204a and else-path 208a operations of a conditional branch are executed in parallel in different PE resources. The final result of an output is selected (e.g., select 212) between outputs of two paths based on the outcome of the conditional operation (predicate value, S). This is illustrated in data flow graph 220a in
In a full predication scheme (Han & C., et al., 2013; Han, et al., 2013), the output of false path operations are suppressed based on a predicate bit (0 for false path operations). Operations that update the same variable have to be mapped to the same PE albeit at different cycles.
In a dual-issue scheme (Han, et al., 2013), each PE receives two instructions, one from the if-path and the other from else-path at each cycle. At run-time, the PE executes only one of the instructions based on the predicate bit. Since an operation from the false path is not executed, a select operation is not required. Operations in the different paths producing the same output (e.g., ct and cf) are merged together to execute on the same PE, as illustrated, for example by PE 254 at iteration 3 in
3. Inefficiencies of Existing Techniques
The fundamental inefficiency of existing solutions in handling loops with control flow is that they do not utilize the knowledge of the branch outcome to reduce the overhead of branch execution—even after the branch outcome is known. For instance, the branch outcome is known at cycle 1 in the partial and full predication schemes (
The other limitation of the existing approaches is that the predicate value must be communicated to the PEs executing the if-path and the else-path operation. This communication is done either by storing the predicate value in the internal register of a PE or through the predicate network via routing. The need for this communication results in restrictions on where the conditional operations can be mapped. For instance, in partial predication, the select operation, c, can be mapped only to PEs in which the corresponding predicate value is available, and in full predication scheme, the operations ct, and cf should be mapped onto the same PE (e.g., PE 251 of
Loops that contain if-then-elses may be accelerated by fetching and executing only the instructions from the path taken by a branch at run time. This may be accomplished by determining the outcome of the if-then-elses prior before instructions are issued to any PEs in a CGRA and using the calculated outcome of the if-then-elses to control whether or not an instruction is issued. This process avoids unnecessarily issuing instructions that will never be executed because they are in the unexecuted branches of the if-then-elses. Compared to partial predication or full predication schemes, this approach avoids both issuing instruction for the unexecuted branch and executing the instructions contained in the branch that does not get selected.
Some embodiment of the present invention contain two parts: (i) executing the branch condition as early as possible, and (ii) once the branch is computed, communicating its results to the Instruction Fetch Unit (IFU) of the CGRA, which then starts to fetch instructions from the correct path. Experimental results on accelerating loop kernels with if-then-else structures from biobench (Albayraktaroglu, et al., 2005) and SPEC (Henning, et al., 2006) benchmark using the present invention resulted in a 34.6%, 36%, 59.4% improvement in performance and a 52.1%, 35.5% and 53.1% lower energy consumption (CGRA power and power spent on instruction fetch operation) as compared to dual-issue technique (Hamzeh, et al., 2014), partial predication scheme (Han, et al., 2013) and full predication scheme (Han and C., et al., 2013), respectively.
Some embodiments of the present computer program product comprise a non-transitory computer readable medium comprising code for performing the steps of: receiving at least one function executed by a computer program; resolving a branching condition to produce an outcome wherein at least one path of the branch is to be executed by a coarse grain reconfigurable array and at least one path of the branch is false; executing the branch condition in at least one processing element of the Coarse Grain Reconfigurable Array and communicating the branch outcome to the Instruction Fetch Unit; and selectively issuing instructions from the instruction memory for the at least one path to be executed. In some embodiments the non-transitory computer readable medium comprises code and hardware components for performing the step of communicating the branching condition outcome and a number of cycles required to execute the at least one path to the instruction fetch unit. In some embodiments the non-transitory computer readable medium comprises code for performing the step of utilizing a delay slot after the step of resolving the branching condition outcome to communicate the branching condition outcome to the instruction fetch unit. In some embodiments the non-transitory computer readable medium comprises code for performing the step of performing operations that are independent of the branching condition outcome in the delay slot. In some embodiments the non-transitory computer readable medium comprises code for performing the step of mapping operations to processing elements in the coarse grain reconfigurable array from both an if-path of the branching condition and an else-path of the branching condition. In some embodiments the non-transitory computer readable medium comprises code for performing the step of pairing the mapped operations from the if-path and the else-path based on a common variable. In some embodiments the non-transitory computer readable medium comprises code for performing the step of pairing a no op instruction with an if-path instruction when the else-path contains fewer operations than the if-path. In some embodiments the non-transitory computer readable medium comprises code for performing the step of pairing a no op instruction with an else-path instruction when the if-path contains few operations than the else-path. In some embodiments the non-transitory computer readable medium comprises code for performing the step of eliminating a select instruction or a phi operation based on the pairing by the common variable and based on the step of communicating which of the at least one paths of the branching condition is to be executed to the instruction fetch unit.
Some embodiments of the present computer program product comprise a non-transitory computer readable medium comprising code for performing the steps of: receiving at least one function executed by a computer program; resolving a branching condition to produce an outcome wherein at least one path of the branch is to be executed by a coarse grain reconfigurable array and at least one path of the branch is false; executing the branch condition in at least one processing element of the Coarse Grain Reconfigurable Array and communicating the branch outcome to the Instruction Fetch Unit; and selectively issuing instructions from the instruction memory for the at least one path to be executed. In some embodiments the non-transitory computer readable medium comprises code and hardware components for performing the step of communicating the branching condition outcome and a number of cycles required to execute the at least one path to the instruction fetch unit. In some embodiments the non-transitory computer readable medium comprises code for performing the step of utilizing a delay slot after the step of resolving the branching condition outcome to communicate the branching condition outcome to the instruction fetch unit. In some embodiments the non-transitory computer readable medium comprises code for performing the step of performing operations that are independent of the branching condition outcome in the delay slot. In some embodiments the non-transitory computer readable medium comprises code for performing the step of mapping operations to processing elements in the coarse grain reconfigurable array from both an if-path of the branching condition and an else-path of the branching condition. In some embodiments the non-transitory computer readable medium comprises code for performing the step of pairing the mapped operations from the if-path and the else-path based on a common variable. In some embodiments the non-transitory computer readable medium comprises code for performing the step of pairing a no op instruction with an if-path instruction when the else-path contains fewer operations than the if-path. In some embodiments the non-transitory computer readable medium comprises code for performing the step of pairing a no op instruction with an else-path instruction when the if-path contains few operations than the else-path. In some embodiments the non-transitory computer readable medium comprises code for performing the step of eliminating a select instruction or a phi operation based on the pairing by the common variable and based on the step of communicating which of the at least one paths of the branching condition is to be executed to the instruction fetch unit.
Some embodiments of the present apparatuses comprise a memory; and a processor coupled to the memory, wherein the processor is configured to execute the steps of: receiving at least one function (e.g., loop kernel) executed by a computer program; resolving a branching condition to produce an outcome wherein at least one path of the branch is to be executed by a coarse grain reconfigurable array and at least one path of the branch is false; executing the branch condition in at least one processing element of the Coarse Grain Reconfigurable Array and communicating the branch outcome to the Instruction Fetch Unit; and selectively issuing instructions from the instruction memory for the at least one path to be executed. In some embodiments, the processor is further configured to execute the step of communicating the branching condition outcome and a number of cycles required to execute the at least one paths to the instruction fetch unit by minimum delay circuit components. In some embodiments, the processor is further configured to execute the step of utilizing a delay slot after the step of resolving the branching condition outcome to communicate the branching condition outcome to the instruction fetch unit. In some embodiments, the processor is further configured to execute the step of performing operations that are independent of the branching condition outcome in the delay slot. In some embodiments, the processor is further configured to execute the step of mapping operations to processing elements in the coarse grain reconfigurable array from both an if-path of the branching condition and an else-path of the branching condition. In some embodiments, the processor is further configured to execute the step of pairing the mapped operations from the if-path and the else-path based on a common variable. In some embodiments, the processor is further configured to execute the step of pairing a no op instruction with an if-path instruction when the else-path contains fewer operations than the if-path. In some embodiments, the processor is further configured to execute the step of pairing a no op instruction with an else-path instruction when the if-path contains few operations than the else-path. In some embodiments, the processor is further configured to execute the step of eliminating a select instruction or a phi operation based on the pairing by the common variable and based on the step of communicating which of the at least one paths of the branching condition is to be executed to the instruction fetch unit.
Some embodiments of the present methods comprise: receiving at least one function (e.g., loop kernel) executed by a computer program, wherein the function includes a branching condition; mapping at least two potential paths of the branching condition to at least one processing element in a coarse grain reconfigurable array by a compiler; resolving the branching condition to produce an outcome wherein at least one path of the branch is to be executed by a coarse grain reconfigurable array and at least one path of the branch is false; executing the branch condition in at least one processing element of the Coarse Grain Reconfigurable Array and communicating the branch outcome to the Instruction Fetch Unit; and selectively issuing instructions from the instruction memory for the at least one path to be executed.
Some embodiments of the present apparatuses comprise: a coarse grain reconfigurable array comprising at least two processing elements; an instruction fetch unit; at least one processing element configured to communicate a branch outcome to the instruction fetch unit; the branch outcome comprising at least a path to be taken; the instruction fetch unit further configured to issue instructions for the path taken. In some embodiments, at least one processing element (which also evaluates the branch condition) is configured to communicate a number of cycles required to execute the path to be taken to the instruction fetch unit.
As used herein the specification, “a” or “an” may mean one or more. As used herein in the claim(s), when used in conjunction with the word “comprising”, the words “a” or “an” may mean one or more than one.
The use of the term “or” in the claims is used to mean “and/or” unless explicitly indicated to refer to alternatives only or the alternatives are mutually exclusive, although the disclosure supports a definition that refers to only alternatives and “and/or.” As used herein “another” may mean at least a second or more.
Throughout this application, the term “about” is used to indicate that a value includes the inherent variation of error for the device, the method being employed to determine the value, or the variation that exists among the study subjects.
The foregoing has outlined rather broadly certain features and technical advantages of some embodiments of the disclosure in order that the detailed description that follows may be better understood. Additional features and advantages will be described hereinafter that form the subject of the claims. It should be appreciated by those having ordinary skill in the art that the specific embodiments disclosed may be readily utilized as a basis for modifying or designing other structures for carrying out the same or similar purposes. It should also be realized that such equivalent constructions do not depart from the spirit and scope of the invention as set forth in the appended claims. The novel features that are believed to be characteristic of the disclosure, both as to its organization and method of operation, together with further objects and advantages will be better understood from the following description when considered in connection with the accompanying figures. It is to be expressly understood, however, that each of the figures is provided for the purpose of illustration and description only and is not intended as a definition of the limits of the disclosure.
The following drawings form part of the present specification and are included to further demonstrate certain aspects of the present invention. The invention may be better understood by reference to one or more of these drawings in combination with the detailed description of specific embodiments presented herein.
Considering that only one path is taken at run time for the if-then-else construct, some embodiments of the present methods, systems, and apparatuses communicate the predicate (result of the branch instruction) to the instruction fetch Unit (IFU) of the CGRA, to selectively issue instructions only from the path taken by the branch at runtime, described herein as the Path Selection based Branch (PSB) technique.
According to one embodiment of the disclosure, for branch outcome based issuing of instructions, additional hardware support may be used such as that shown in
Some embodiments may also include a compiler that maps operations from the loop kernel (including if-path, else-path and select or phi operations) onto the PEs of the time-extended CGRA 524 (similar to the time extended illustration of the 2×2 CGRA illustrated in
The result is a better utilization of PE resources and more PEs being available to map operations from adjacent iterations to facilitate the use of a modulo scheduling scheme to further improve the performance. Comparing this to a prior art system where if-path and else-path operations are mapped onto different PEs, such as that shown in
Hence, by pairing operations from the if-path and from the else-path to form a fused node, e.g., nodes 510, 514, 518, and 522 in
Since pairing of operations from the if-path and the else-path results in improved resource utilization and performance, some embodiments address obtaining a valid pairing of operations. The pairing may ensure the correct functionality of the loop kernel. The problem of optimal pairing may be formulated as finding a transformation T(D)=P from the input Data Flow Graph (DFG): D=(N,E) to an output DFG: P(M,R) with fused nodes, with the objective of minimizing |M| (N and M represent the set of nodes in D and P) while retaining the correct functionality. The description below explains one embodiment of compiler techniques or problem formation that may be used in accordance with the present techniques to optimize PE utilization and performance in a CGRA.
Input:
DFG: D=(N, E) may be a data flow graph that represents the loop kernel to be processed, where the set of vertices N are the operations in the loop kernel, and for any two vertices, u, vεN, e=(u,v)εE if and only if the operation corresponding to v is data dependent or predicate dependent on the operation u. For a loop with control flow N={Nif∪Nelse∪Nother} where {Nif} is the set of nodes representing the operations in the if-path and likewise {Nelse} for the else-path. Nother is the set of nodes representing operations not in the if-path or the else-path and includes select operations.
Output:
DFG: P=(M, R): Where M may be the set of nodes in the transformed DFG representing the operations in the loop kernel with M=M{Mfused∪Mother}. The nodes Mfused represent the fused nodes. In some embodiments, each fused node mεMfused may be a tuple m=mif, melse, where mifεNif∪{nop} and melseεNelse∪{nop}. For nodes x, yεM fused, r=(x, y)εR if and only if there is an edge eif=(xif, yif)εE or an edge eelse=(xelse, yelse)εE. For some nodes xotherεMother, yεMfused; r=(xother,y)εR if and only if there is an edge eif=(xother, yif)εE or an edge eelse=(xother, yelse)εE where xotherεNother. For nodes xεMfused, yotherεMother, r=(x, yother)εR if and only if there is an edge eif=(xif, yother)εE or an edge eelse=(xelse, yother)εE where yotherεNother.
Valid Output:
The output DFG P obtained after transformation is valid if and only if for two vertices x, y with x=(xif, xelse); y=(yif, yelse)εMfused and r=(x, y)εR then if there is a path from xif to yif then there is no path (intra-iteration) from yelse to xelse and if there is a path from xelse to yelse there is no path (intra-iteration) from yif to xif originally in the input DFG. However, recurrence paths satisfying inter iteration dependencies are valid.
Optimization:
Some embodiments seek to minimize |M| under constraints of a valid output. |Mfused| can be minimized by minimizing number of nops used to make a pair. |Mother| can be minimized by eliminating the eligible select or phi operations that belong to Nother.
1) Select/Phi Operation Elimination:
A select operation is used to select an output of a variable updated in both paths. If the if-path operation and the else-path operation updating the same variable are paired to form a fused node, there is no need for a select operation since at run time only one of the operations is executed; the output of the fused node has the right value after execution. For example, DFG 700a in
In summary, to improve performance and energy efficiency, the present invention utilizes the branch outcome to issue instructions only from the path taken. This eliminates fetching and execution of unnecessary operations and the need for predicate communication hence overcoming the inefficiencies associated with existing techniques.
The process of creating a DFG from CFG (Control Flow Graph) of a loop is presented in (Johnson & Pingali, 1993). The operations from the if-path and else-path form the set of operations Nif and Nelse respectively. The algorithm for forming the DFG with fused node is shown in Algorithm 1. According to one embodiment of the disclosure,
Proof of Correctness:
For nodes xt, ytεytεNif and xf, yfεNelse, with partial order of xt<yt and xf<yf, meaning yt, yf cannot be scheduled earlier than xt, xf. An example of bad scheduling is shown by incorrect pairings 632 containing xt, yf and 628 containing yt, xf in
Support for Nested Conditionals:
PSB provides maximum performance improvement when the number of operations in the conditional path is large. Hence, for nested conditionals, the formation of fused nodes is done for the outermost conditional block. The number of operations for the inner nests is typically small and hence is acceptable to be handled by partial predication (Han, et al., 2013) (preferred over full predication to alleviate the tight restrictions on mapping). The if-path and else-path operations of the fused nodes are inherently composed of their respective path's inner conditionals and their operations.
4. Experimental Results
A. PSB Achieves Lower II Compared to Existing Techniques to Accelerate Control Flow.
To evaluate the performance of PSB, CGRA has been modelled as an accelerator in the Gem5 system simulation framework (Binkert, et al., 2011), integrated with one embodiment of a PSB compiler technique as a separate pass in the LLVM compiler framework (Lattner & Adve, 2004). The DFG obtained after PSB transformation was mapped using REGIMap mapping algorithm (Hamzeh, et al., 2013) modified to accommodate the delay slot required for correct functioning. Computational loops with control flow were extracted from SPEC2006 (Henning, 2006), biobench (Albayraktaroglu, 2005) benchmarks after −O3 optimization in LLVM. The loops were mapped on a 4×4 torus interconnected CGRA with sufficient instruction and data memory.
The full predication scheme (fill 912) presented in (Han et al., 2013) has the lowest performance due to the tight restriction on mapping of operations in the conditional path. Such operations must be mapped only to the PE in which the predicate value is available, which increases the schedule length and ultimately the II. Partial predication scheme (fill 908) performs better because it is devoid of such restrictions and the overhead here is the introduction of select operations. Even though the dual issue scheme (fill 916) (Han, et al., 2010) eliminates execution of unnecessary operations, it suffers from restriction in mapping due to overhead in communicating the predicate to all the merged nodes. The performance of one embodiment of the presently disclosed PSB compiler technique depends on the size of the if-then-else. For kernels in which the number of operations in the conditional path is more (51% of operations in tree, gapaling, gcc are in the conditional path) there was a very significant (up to 25% reduction in node size and 45% reduction in edge size on an average due to pairing of operations by PSB) improvement of II—an average of 62% better than other techniques. For benchmarks with smaller if-then-elses, the tested embodiment achieved a moderate reduction in II (11% in sphinx3, fasta, calculix). In these cases, the number of operations in the conditional path was (35%) which lead to a reduction in the DFG size of 15% and 23% reduction in node and edge size. Therefore, PSB is particularly well suited for loop kernels with relatively large number of operations in the conditional path but is effective for other sized loop kernels as well. By executing operations only from the path taken and eliminating the predicate communication overhead, the tested PSB embodiment overcame the inefficiencies associated with existing techniques, and achieved a performance improvement of 34.6%, 36% and 59.4% on an average compared to the state of the art dual issue scheme (Hamzeh, et al., 2014), partial predication scheme (Mahlke, et al., 1995) and State based Full Predication (SFP) scheme presented in Han, et al., 2013, respectively.
B. PSB Architecture Embodiments Area and Frequency
One modeled embodiment implemented the RTL model of a 4×4 CGRA including an IFU with torus interconnection. Since all PEs in this embodiment have symmetrical interconnections, a single designated PE was connected to the IFU in the PSB architecture. Other embodiment may contain more than one IFU or more than one PE connected to one or more IFUs. A mapping generated for a generic 4×4 CGRA template can be panned across the CGRA template so as to allocate the branch operation to the designated PE. This is not a restriction in mapping because the symmetry of interconnection. For multiple independent branches, predicates can be communicated to the designated PE through predicate network and then to the IFU. The RTL model embodiments were synthesized in 65 nm node using RTL compiler tool, functionally verified, and placed and routed using Cadence Encounter. Results are tabulated in Table I. The disclosed PSB architecture does not incur any significant hardware overhead.
C. PSB Embodiments have Lower Energy Consumption
To evaluate energy consumption, the dynamic power for each type of PE operation (ALU, routing or IDLE) from (Kim & Lee, 2012) and scaled to fit their synthesized RTL was estimated. The power for an instruction fetch operation was modelled by a 2 kb configuration cache, in 65 nm node, from the cacti 5.3 tool (CACTI, 2008). The total energy spent in executing a kernel of each benchmark (partial predication, full predication, and dual-issue) was modelled as a function of the energy spent per PE per cycle depending upon the type of operation and the instruction fetch power.
The following examples are included to demonstrate preferred embodiments of the invention. It should be appreciated by those of skill in the art that the techniques disclosed in the examples which follow represent techniques discovered by the inventor to function well in the practice of the invention, and thus can be considered to constitute preferred modes for its practice. However, those of skill in the art should, in light of the present disclosure, appreciate that many changes can be made in the specific embodiments which are disclosed and still obtain a like or similar result without departing from the spirit and scope of the invention.
Some embodiments of the present methods comprise: receiving at least one function executed by a computer program; resolving a branching condition to produce an outcome wherein at least one path of the branch is to be executed by a coarse grain reconfigurable array (e.g., CGRA 1204 in
Some embodiments of the present apparatuses comprise a memory; and a processor coupled to the memory, wherein the processor is configured to execute the steps of: receiving at least one function (e.g., loop kernel) executed by a computer program; resolving a branching condition to produce an outcome wherein at least one path of the branch is to be executed by a coarse grain reconfigurable array and at least one path of the branch is false; executing the branch condition in at least one processing element of the Coarse Grain Reconfigurable Array and communicating the branch outcome to the Instruction Fetch Unit; and selectively issuing instructions from the instruction memory for the at least one path to be executed. In some embodiments, the processor is further configured to execute the step of communicating the branching condition outcome and a number of cycles required to execute the at least one paths to the instruction fetch unit by minimum delay circuit components. In some embodiments, the processor is further configured to execute the step of utilizing a delay slot after the step of resolving the branching condition outcome to communicate the branching condition outcome to the instruction fetch unit. In some embodiments, the processor is further configured to execute the step of performing operations that are independent of the branching condition outcome in the delay slot. In some embodiments, the processor is further configured to execute the step of mapping operations to processing elements in the coarse grain reconfigurable array from both an if-path of the branching condition and an else-path of the branching condition. In some embodiments, the processor is further configured to execute the step of pairing the mapped operations from the if-path and the else-path based on a common variable. In some embodiments, the processor is further configured to execute the step of pairing a no op instruction with an if-path instruction when the else-path contains fewer operations than the if-path. In some embodiments, the processor is further configured to execute the step of pairing a no op instruction with an else-path instruction when the if-path contains few operations than the else-path. In some embodiments, the processor is further configured to execute the step of eliminating a select instruction or a phi operation based on the pairing by the common variable and based on the step of communicating which of the at least one paths of the branching condition is to be executed to the instruction fetch unit.
Some embodiments of the present methods comprise: receiving at least one function (e.g., loop kernel) executed by a computer program, wherein the function includes a branching condition; mapping at least two potential paths of the branching condition to at least one processing element in a coarse grain reconfigurable array by a compiler; resolving the branching condition to produce an outcome wherein at least one path of the branch is to be executed by a coarse grain reconfigurable array and at least one path of the branch is false; executing the branch condition in at least one processing element of the Coarse Grain Reconfigurable Array and communicating the branch outcome to the Instruction Fetch Unit; and selectively issuing instructions from the instruction memory for the at least one path to be executed.
Some embodiments of the present apparatuses comprise: a coarse grain reconfigurable array comprising at least two processing elements; an instruction fetch unit; at least one processing element configured to communicate a branch outcome to the instruction fetch unit; the branch outcome comprising at least a path to be taken; the instruction fetch unit further configured to issue instructions for the path taken. In some embodiments, at least one processing element (which also evaluates the branch condition) is configured to communicate a number of cycles required to execute the path to be taken to the instruction fetch unit.
All of the methods disclosed and claimed herein can be made and executed without undue experimentation in light of the present disclosure. While the compositions and methods of this invention have been described in terms of preferred embodiments, it will be apparent to those of skill in the art that variations may be applied to the methods and in the steps or in the sequence of steps of the method described herein without departing from the concept, spirit and scope of the invention. More specifically, it will be apparent that certain agents which are both chemically and physiologically related may be substituted for the agents described herein while the same or similar results would be achieved. All such similar substitutes and modifications apparent to those skilled in the art are deemed to be within the spirit, scope and concept of the invention as defined by the appended claims.
The following references, to the extent that they provide exemplary procedural or other details supplementary to those set forth herein, are specifically incorporated herein by reference.
This application claims priority to U.S. Provisional Application No. 62/118,383 filed Feb. 19, 2015, which is specifically incorporated herein by reference without disclaimer.
Number | Date | Country | |
---|---|---|---|
62118383 | Feb 2015 | US |