The present disclosure relates to systems and methods for efficiently accelerating general purpose applications with complicated loops using coarse-grained reconfigurable arrays.
Accelerators are used to accelerate specialized or computationally-intensive sections of applications. A coarse-grained reconfigurable array (CGRA) is one type of accelerator that is programmable, yet power efficient. While CGRAs have conventionally been used for special-purpose applications, their programmability and power efficiency has resulted in a push to use CGRAs for general-purpose applications. However, general-purpose applications often include computationally-intensive loops featuring several levels of nested loops and conditionals. Several compiler techniques have been developed to map loops and conditionals onto a CGRA in an efficient manner. However, these techniques have not been able to efficiently map complex loops and conditionals (e.g., nested loops and loops containing nested conditionals, etc.) onto a CGRA. Accordingly, there is a need for systems and methods for mapping complex loops and conditionals onto a CGRA in an efficient manner.
In one embodiment, a coarse-grained reconfigurable array includes a processing element array, instruction memory circuitry, data memory circuitry, and an instruction fetch unit. The processing element array includes a number of processing elements. The instruction memory circuitry is coupled to the processing element array and configured to store a set of instructions. During each one of a number of processing cycles, the instruction memory circuitry provides instructions from the set of instructions to the processing elements. The instruction fetch unit is coupled to the processing element array and the instruction memory circuitry and configured to receive a result of a conditional instruction evaluated by one of the processing elements and provide the instruction fetch signals based at least in part on the result of the conditional instruction such that only instructions associated with a correct branch of the conditional instruction are provided to the plurality of processing elements. By communicating the conditional outcome by the processing element array to the instruction fetch unit, only the correct instruction of the conditional paths of the loops are evaluated, thereby increasing the efficiency of the coarse-grained reconfigurable array.
Those skilled in the art will appreciate the scope of the present disclosure and realize additional aspects thereof after reading the following detailed description of the preferred embodiments in association with the accompanying drawing figures.
The accompanying drawing figures incorporated in and forming a part of this specification illustrate several aspects of the disclosure, and together with the description serve to explain the principles of the disclosure.
The embodiments set forth below represent the necessary information to enable those skilled in the art to practice the embodiments and illustrate the best mode of practicing the embodiments. Upon reading the following description in light of the accompanying drawing figures, those skilled in the art will understand the concepts of the disclosure and will recognize applications of these concepts not particularly addressed herein. It should be understood that these concepts and applications fall within the scope of the disclosure and the accompanying claims.
It will be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of the present disclosure. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items.
It will be understood that when an element such as a layer, region, or substrate is referred to as being “on” or extending “onto” another element, it can be directly on or extend directly onto the other element or intervening elements may also be present. In contrast, when an element is referred to as being “directly on” or extending “directly onto” another element, there are no intervening elements present. Likewise, it will be understood that when an element such as a layer, region, or substrate is referred to as being “over” or extending “over” another element, it can be directly over or extend directly over the other element or intervening elements may also be present. In contrast, when an element is referred to as being “directly over” or extending “directly over” another element, there are no intervening elements present. It will also be understood that when an element is referred to as being “connected” or “coupled” to another element, it can be directly connected or coupled to the other element or intervening elements may be present. In contrast, when an element is referred to as being “directly connected” or “directly coupled” to another element, there are no intervening elements present.
Relative terms such as “below” or “above” or “upper” or “lower” or “horizontal” or “vertical” may be used herein to describe a relationship of one element, layer, or region to another element, layer, or region as illustrated in the Figures. It will be understood that these terms and those discussed above are intended to encompass different orientations of the device in addition to the orientation depicted in the Figures.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used herein, the singular forms “a,” “an,” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises,” “comprising,” “includes,” and/or “including” when used herein specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. It will be further understood that terms used herein should be interpreted as having a meaning that is consistent with their meaning in the context of this specification and the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
The instruction memory circuitry 16 stores a set of instructions to be evaluated by the CGRA 10. As discussed below, the set of instructions may include one or more loops. The data memory circuitry 14 stores data that may be operated on by the set of instructions. During each one of a number of processing cycles, instructions from the instruction memory circuitry 16, and, as necessary, data from the data memory circuitry 14, are provided to the processing elements 20. Specifically, in a single processing cycle each one of the processing elements receive instructions to evaluate from the instruction memory. Based on mapping performed by a compiler, some processing elements receive useful operations to evaluate and other processing elements receive no-operation (no-op) instructions. When data is acted on by the instruction of one of the processing elements 20, the data may be provided from the data memory circuitry 14, from another processing element 20, or from a register file in the processing element 20 (stored during a previous processing cycle). Each processing element 20 evaluates the instruction provided to it and may provide a result of the evaluation to a register file in the processing element 20, to another processing element 20, to the data memory circuitry 14, to the instruction fetch unit 18, or to multiple sources. The instruction fetch unit 18 provides instruction fetch signals to the instruction memory 16. These instruction fetch signals determine the instructions provided to the processing elements 20 during a processing cycle.
In conventional CGRAs, instruction memory circuitry provides instructions to the processing elements in a sequential manner. In other words, instructions from the instruction memory circuitry are provided to the processing elements in each processing cycle (i.e., a CGRA cycle), in exactly the order they are stored. Accordingly, the instructions provided to the processing elements during a processing cycle are determined only by the layout of the instructions in memory. Such an approach limits the performance gain of conventional CGRAs over standard processors. Since there is no way to dynamically provide instructions to the processing elements based on the result of instructions evaluated by the processing elements, conventional CGRAs must rely upon full predication or partial predication to evaluate loops with conditionals. Full predication and partial predication require the processing elements to evaluate every instruction in every path of a condition, and further introduce significant overhead in the form of additional select instructions. This limits the efficiency of conventional CGRAs when evaluating loops with conditionals. This is especially true when the loops and conditionals are complex (e.g., nested).
In contrast to conventional CGRAs, the instruction fetch unit 18 of the CGRA 10 discussed herein provides instruction fetch signals that allow for skipping instructions stored in the instruction memory circuitry 16 to be provided to the processing elements 20 in subsequent processing cycles. More specifically, the instruction fetch unit 18 provides instruction fetch signals to the instruction memory circuitry 16 such that instructions are dynamically provided based on the result of one or more conditional instructions evaluated by the processing elements 20.
Details of the instruction memory circuitry 16 and the instruction fetch unit 18 are shown in
In operation, the instruction memory 22 stores the set of instructions. The instructions are laid out in a specific manner when compiled, and include instruction skip values associated with conditional instructions. Details regarding how the instructions are skipped are discussed below. In general, instructions are skipped based on the outcome of conditional statements evaluated by one or more processing elements at runtime. The fetch signal generator circuitry 26 generates instruction fetch signals, which are provided to the instruction memory 22 and cause the instruction memory to load instructions from the instruction memory 22 into the instruction buffer 24. When a conditional instruction is loaded into the instruction buffer 24, the instruction or a reference to the instruction and an instruction skip value associated with the instruction are provided from the instruction memory 22 to the conditional lookaside buffer 28. Eventually, the conditional instruction is provided from the instruction memory 22 to one of the processing elements 20 in the processing element array 12. When the conditional instruction is evaluated by one of the processing elements 20 in the processing element array 12, the result is provided to the fetch signal generator circuitry 26. The fetch signal generator circuitry 26 uses the result of the conditional instruction along with the instruction skip value for the conditional instruction in the conditional lookaside buffer 28 to determine a number of instructions in the instruction memory 22 to skip, thereby allowing only those instructions associated with a single conditional branch to be evaluated. This may be done an arbitrary number of times, allowing for the application of these principles to nested conditionals and nested loops. Instructions associated with conditional branches that are never reached during execution are not evaluated, and no select instructions are needed. Accordingly, the CGRA 10 is able to efficiently evaluate applications with arbitrarily nested loops and loops with nested conditionals.
As will be discussed below, the set of instructions is laid out in the instruction memory 22 such that the number of processing cycles to fully evaluate each branch created by a conditional instruction are symmetrical. Further, an instruction skip value is associated with each conditional instruction. If the result of the conditional instruction is true, the fetch signal generator circuitry 26 provides instruction fetch signals such that the instruction memory 22 loads only the instructions associated with the true branch of the conditional instruction from the set of instructions into the memory buffer 24 and skips the instructions associated with the false branch of the conditional instruction (does not load them into the memory buffer 24). Alternatively, if the result of the conditional instruction is false, the fetch signal generator circuitry 26 provides instruction fetch signals such that the instruction memory 22 skips the instructions associated with the true branch of the conditional instruction (does not load them into the memory buffer 24) and loads the instructions associated with the false branch of the conditional instruction from the set of instructions into the memory buffer 24.
The operations in the data dependency graph shown in
In the first line, a first processing element PE1 operates on variable d. Specifically, the value of d is fetched from memory. A second processing element PE2 evaluates h, which, as described above is the value of the conditional x % i==1. The value of h is provided to the instruction fetch unit 18 as discussed above. A third processing element PE3 remains idle. As discussed herein, remaining idle may be equivalent to performing a no-op instruction. A fourth processing element PE4 operates on variable b, specifically fetching the value of b from memory.
In the second line, the first processing element PE1 performs the operation on d specified by the input code if h is true. Specifically, the first processing element PE1 performs the operation d+=0. The second processing element PE2 evaluates g, which, as discussed above is the value of the conditional y % i==1. The value of g is provided to the instruction fetch unit 18 as discussed above. The third processing element PE3 operates on variable c, specifically fetching the value of c from memory. The fourth processing element PE4 operates on variable a, specifically fetching the value of a from memory.
In the third line, the first processing element PE1 performs the operation on a specified by the input code if both h and g are true. Specifically, the first processing element PE1 performs the operation a+=0. The second processing element PE2 remains idle. The third processing element PE3 performs the operation on c specified by the input code if both h and g are true. Specifically, the third processing element performs the operation c+=0. The fourth processing element PE4 performs the operation on b specified by the input code if both h and g are true. Specifically, the fourth processing element PE4 performs the operation b+=0.
In the fourth line, the first processing element PE1 performs the operation on a specified by the input code if h is true and g is false. Specifically, the first processing element PE1 performs the operation a=a+1. The second processing element PE2 remains idle. The third processing element PE3 performs a no-op. The fourth processing element PE4 performs the operation on b specified by the input code if h is true and g is false. Specifically, the fourth processing element PE4 performs the operation b=b+1.
In the fifth line, the first processing element PE1 performs the operation on d specified by the input code if h is false. Specifically, the first processing element PE1 performs the operation d=d+1. The second processing element PE2 performs a no-op. The third processing element PE3 operates on variable c, specifically fetching the value of c from memory. The fourth processing element operates on variable a, specifically fetching the value of a from memory.
In the sixth line, the first processing element retains the value of a (does not operate on a, but retains its value). The second processing element PE2 remains idle. The third processing element PE3 retains the value of c. The fourth processing element PE4 retains the value of b.
In the seventh line, the first processing element PE1, the third processing element PE3, and the fourth processing element PE4 perform no-ops, while the second processing element PE2 remains idle.
In the eighth line, the loop begins again. The first processing element PE1 remains idle. The second processing element PE2 performs an operation on i, specifically incrementing the value of i. The third processing element PE3 remains idle. The fourth processing element operates on variable a, specifically receiving the value of a evaluated in a previous processing cycle by another processing element.
The instructions in lines 2-4 and 5-7 represent fused nodes in the data dependency graph shown in
Lines 3 and 4 represent different instructions of the branches for conditional instruction g. If g is true, line 3 should be provided to the processing elements 20 and evaluated. If g is false, line 4 should be provided to the processing elements 20 and evaluated. The instruction skip value for this branch is one, since each one of the branches includes one line, or requires one processing cycle, to complete.
As shown, when a particular branch is asymmetrical (e.g., when a variable is acted upon only in one state of a conditional), a no-op is paired with the operation in one branch so that the branches are equal in length. One example of this is shown for the third processing element PE3 in lines 3 and 4. Referring back to
Referring back to the CGRA 10 discussed above, the instructions illustrated in
Notably, while the instructions shown in
The set of instructions illustrated above in
In the first line, a first processing element PE1 operates on variable a. Specifically, the first processing element PE1 performs the operation a=a+1. A second processing element PE2 operates on variable b. Specifically, the second processing element PE2 performs the operation b=b+1. A third processing element PE3 remains idle. A fourth processing element operates on variable i. Specifically, the fourth processing element PE4 performs the operation i++.
In the second line, the first processing element PE1 remains idle. The second processing element PE2 operates on variable c. Specifically, the second processing element PE2 performs the operation c=a*b. The third processing element PE3 operates on variable d. Specifically, the third processing element PE3 performs the operation d=b*2. The fourth processing element PE4 evaluates cmp, which is the value of the conditional x>i.
In the third line, the first processing element PE1 operates on variable a. Specifically, the first processing element PE1 performs the operation a=a+1. The second processing element PE2 operates on variable b. Specifically, the second processing element PE2 performs the operation b=b+1. The third processing element PE3 operates on variable e. Specifically, the third processing element PE3 performs the operation on e if cmp is true, which is e=c+1. The fourth processing element operates on variable i. Specifically, the fourth processing element PE4 performs the operation i++.
In the fourth line, the first processing element PE1 operates on variable a. Specifically, the first processing element PE1 performs the operation a=a+1. The second processing element PE2 operates on variable b. Specifically, the second processing element PE2 performs the operation b=b+1. The third processing element PE3 operates on variable e. Specifically, the third processing element PE3 performs the operation on e if cmp is false, which is e=d+1. The fourth processing element operates on variable i. Specifically, the fourth processing element PE4 performs the operation i++.
The instructions shown on lines 3 and 4 represent fused nodes in the data dependency graph shown in
Referring back to the CGRA 10 discussed above, the instructions related to
Those skilled in the art will recognize improvements and modifications to the preferred embodiments of the present disclosure. All such improvements and modifications are considered within the scope of the concepts disclosed herein and the claims that follow.
This invention was made with government support under 1055094, 1525855 and 1723476 awarded by the National Science Foundation. The government has certain rights in the invention.