Technology trends such as growing wire delays, power consumption limits, and diminishing clock rate improvements are presenting historical instruction set architectures such as RISC (reduced instruction set computer), CISC (complex instruction set computer), and VLIW (very long instruction word) with difficult challenges. To show continued performance growth, responsibilities between programmer, compiler, and computer processor hardware will likely need to be shared in ways to maximize opportunities for discovery and exploitation of high instruction-level parallelism.
Efficient encoding of high fanout communication patterns in computer programming is achieved through utilization of producer and move instructions in an instruction set architecture (ISA) that supports direct instruction communication where a producer encodes identities of consumers of results directly within an instruction. The producer instructions may fully encode the targeted consumers with an explicit target distance or utilize compressed target encoding in which a field in the instruction provides a bit vector which specifies the list of consumer instructions. A variety of move instructions target different numbers of consumers and may also utilize full or compressed target encoding. In consumer paths where a producer is unable to target all consumers, a compiler may utilize various combinations of producer and move instructions, using full and/or compressed target encoding to build a fanout tree that efficiently propagates the producer results to all the targeted consumers.
The present high fanout communication encoding can advantageously improve the functionality of computer processors that run direct instruction communication ISAs by reducing the bit-per-instruction overhead when implementing fanouts when compared to the conventional broadcasting alternative. In addition, direct instruction communication can reduce reliance on broadcasting channels which are typically limited. The broadcast ID (identifier) field in the instructions, which enables results to be broadcast over a lightweight network to all instructions listening on that ID, can be repurposed or eliminated in some cases which may further enhance processor efficiency by increasing functionality or reducing instruction length.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter. Furthermore, the claimed subject matter is not limited to implementations that solve any or all disadvantages noted in any part of this disclosure. It will be appreciated that the above-described subject matter may be implemented as a computer-controlled apparatus, a computer process, a computing system, or as an article of manufacture such as one or more computer-readable storage media. These and various other features will be apparent from a reading of the following Detailed Description and a review of the associated drawings.
Like reference numerals indicate like elements in the drawings. Elements are not drawn to scale unless otherwise indicated.
Explicit Data Graph Execution (EDGE) is an instruction set architecture (ISA) that partitions the work between a compiler and processor hardware differently from conventional RISC (Reduced Instruction Set Computer), CISC (Complex Instruction Set Computer), or VLIW (Very Long Instruction Word) ISAs to enable high instruction-level parallelism with high energy efficiency. Instructions inside blocks execute in dataflow order, which removes the need for costly register renaming and provides power-efficient out-of-order execution.
The EDGE ISA defines the structure of, and the restrictions on, these blocks. In addition, instructions within each block employ direct Instruction Communication rather than communication through registers as in conventional ISAs. While the following discussion is provided in the context of EDGE, the disclosed techniques may also be applicable to other ISAs and microarchitectures that utilize direct instruction communication.
An EDGE compiler encodes instruction dependences explicitly using target-form encoding to thereby eliminate the need for the processor hardware to discover dependencies dynamically. Thus, for example, if instruction P produces a value for instruction C, P's instruction bits specify C as a consumer, and the processor hardware routes P's output result directly to C. The compiler explicitly encodes the data dependencies through the instruction set architecture, freeing the processor from needing to rediscover these dependencies at runtime. Computing devices using EDGE may thus be viewed as dataflow machines that enable fine grain data-driven parallel computation.
As shown in
Two fields 205 and 220 support operation code (opcode) that provides the instruction to execute along with the number of input operands to receive. The predicate field 210 indicates whether the instruction must wait on a predicate bit, and whether to execute if that bit is true or false. The broadcast ID field 215 enables results to be broadcast over a lightweight network to all instructions listening on that ID. In implementations where the present efficient encoding of high fanout communications is utilized, the broadcast ID field may be optionally repurposed for other uses.
As the instruction 200 supports dataflow encoding for two targets, as shown in
The sequence thus defines a particular consumer path 515 by which results are communicated among instructions in a block. The present encoding techniques for high fanout communications may be used in scenarios in which producers and consumers are in distinct instruction blocks as well. Each MOV2 instruction propagates an input to two outputs. Thus, the ten MOV2 instructions shown in
Communication fanouts can be similarly implemented using the MOV3 and MOV4 instructions. For example,
As shown with MOV3, the number of consumers per instruction can increase which can thereby enable a producer to reach more consumers with fewer instructions being needed overall to build the fanout tree when compared to the MOV2 instruction. However, the instruction length also increases to support the additional consumer target encodings. There can thus be tradeoffs among target reach distance, instruction count, and instruction length. Increasing the number of consumers specified by a MOV instruction, at some point, yields an unsupportable instruction length.
As shown, the 12 consumer instructions 915 in the consumer path 920 are targeted using four MOVHF8 instructions. Accordingly, the compressed encoding methodology using a bit vector may enable more fanout per bit of move instruction but may suffer a limitation of not being able to specify an arbitrary consumer instruction as with the full encoding methodology. The maximum target reach may also be limited as a compressed encoded target field would need a 128-bit field to reach the 128th subsequent instruction in a block. However, the MOVHF8 instruction provides an efficient encoding methodology compared to MOV2, as discussed above, and supports high fanout implementations.
As shown, two MOVHF24 instructions are utilized to communicate the result 1105 to the consumer instructions 1115 according to the path 1120. The MOVHF24 instruction provides efficient target encoding and may support large fanouts. While its reach is less limited than that of the MOVHF8 instruction, the MOVHF24 instruction is twice as big.
The high fanout producer with compressed target encoding may be used to implement larger fanouts than those supported by the current producer instructions which are limited to two target fields as shown in
To communicate the result to each of the consumer instructions 1315 along the path 1320 that is beyond the four subsequent consecutive instructions, a fanout tree 1300 may be utilized, as shown. Any combination of move instruction types MOV2, MOV3, MOV4, MOVVH8, MOVHF24 may be utilized in the fanout tree.
Both the fully encoded and compressed encoded producer instructions, in combination with the move instructions, can reduce overhead for given consumer paths having high fanout compared with conventional broadcast channel utilization to thereby improve processor performance. For example, in most performance benchmarks, the overhead (expressed as added bits per static instruction) is significantly less for direct dataflow communication using the techniques described herein compared to broadcast channels. However, in consumer paths that have a relatively high number of consumers relative to the overall instruction count, utilization of broadcast channels as for high fanout communications may provide improved processor performance by reducing move operations. The reduction in move operations can result in fetching and executing fewer instructions which can save energy.
The processor architecture 20 typically includes multiple processor cores (representatively indicated by reference numeral 1725), in a tiled configuration, that are interconnected by an on-chip network (not shown) and further interoperated with one or more level 2 (L2) caches (representatively indicated by reference numeral 1730). While the number and configuration of cores and caches can vary by implementation, it is noted that the physical cores can be merged together, in a process termed “composing,” during runtime of the program 1715, into one or more larger logical processors that can enable more processing power to be devoted to a program execution. Alternatively, when program execution supports suitable thread-level parallelism, the cores 1725 can be split, in a process called “decomposing,” to work independently and execute instructions from independent threads.
The front-end control unit 1802 may include circuitry configured to control the flow of information through the processor core and circuitry to coordinate activities within it. The front-end control unit 1802 also may include circuitry to implement a finite state machine (FSM) in which states enumerate each of the operating configurations that the processor core may take. Using opcodes (as described below) and/or other inputs (e.g., hardware-level signals), the FSM circuits in the front-end control unit 1802 can determine the next state and control outputs.
Accordingly, the front-end control unit 1802 can fetch instructions from the instruction cache 1804 for processing by the instruction decoder 1808. The front-end control unit 1802 may exchange control information with other portions of the processor core 1725 over control networks or buses. For example, the front-end control unit may exchange control information with a back-end control unit. The front-end and back-end control units may be integrated into a single control unit in some implementations.
The front-end control unit 1802 may also coordinate and manage control of various cores and other parts of the processor architecture 1720 (
The front-end control unit 1802 may further process control information and meta-information regarding blocks of instructions that are executed atomically. For example, the front-end control unit 1802 can process block headers that are associated with blocks of instructions. The block header may include control information and/or meta-information regarding the block of instructions. Accordingly, the front-end control unit 1802 can include combinational logic, state machines, and temporary storage units, such as flip-flops to process the various fields in the block header.
The front-end control unit 1802 may fetch and decode a single instruction or multiple instructions per clock cycle. The decoded instructions may be stored in an instruction window 1810 that is implemented in processor core hardware as a buffer. The instruction window 1810 can support an instruction scheduler 1830, in some implementations, which may keep a ready state of each decoded instruction's inputs such as predications and operands. For example, when all of its inputs (if any) are ready, a given instruction may be woken up by instruction scheduler 1830 and be ready to issue.
Before an instruction is issued, any operands required by the instruction may be stored in the left operand buffer 1812 and/or the right operand buffer 1814, as needed. Depending on the opcode of the instruction, operations may be performed on the operands using ALU 1816 and/or ALU 1818 or other functional units. The outputs of an ALU may be stored in an operand buffer or stored in one or more registers 1820. Store operations that issue in a data flow order may be queued in load/store queue 1822 until a block of instructions commits. When the block of instructions commits, the load/store queue 1822 may write the committed block's stores to a memory. The branch predictor 1806 may process block header information relating to branch exit types and factor that information in making branch predictions.
As noted above, the processor architecture 1720 (
In this way, source operands are not specified with the instruction, and instead, they are specified by the instructions that target the ADD instruction. The compiler 1705 (
By way of example, and not limitation, computer-readable storage media may include volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules, or other data. For example, computer-readable media includes, but is not limited to, RAM, ROM, EPROM (erasable programmable read only memory), EEPROM (electrically erasable programmable read only memory), Flash memory or other solid state memory technology, CD-ROM, DVDs, HD-DVD (High Definition DVD), Blu-ray or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by the architecture 1900.
According to various embodiments, the architecture 1900 may operate in a networked environment using logical connections to remote computers through a network. The architecture 1900 may connect to the network through a network interface unit 1916 connected to the bus 1910. It may be appreciated that the network interface unit 1916 also may be utilized to connect to other types of networks and remote computer systems. The architecture 1900 also may include an input/output controller 1918 for receiving and processing input from a number of other devices, including a keyboard, mouse, touchpad, touchscreen, control devices such as buttons and switches, or electronic stylus (not shown in
It may be appreciated that the software components described herein may, when loaded into the processor 1902 and executed, transform the processor 1902 and the overall architecture 1900 from a general-purpose computing system into a special-purpose computing system customized to facilitate the functionality presented herein. The processor 1902 may be constructed from any number of transistors or other discrete circuit elements, which may individually or collectively assume any number of states. More specifically, the processor 1902 may operate as a finite-state machine, in response to executable instructions contained within the software modules disclosed herein. These computer-executable instructions may transform the processor 1902 by specifying how the processor 1902 transitions between states, thereby transforming the transistors or other discrete hardware elements constituting the processor 1902.
Encoding the software modules presented herein also may transform the physical structure of the computer-readable storage media presented herein. The specific transformation of physical structure may depend on various factors, in different implementations of this description. Examples of such factors may include, but are not limited to, the technology used to implement the computer-readable storage media, whether the computer-readable storage media is characterized as primary or secondary storage, and the like. For example, if the computer-readable storage media is implemented as semiconductor-based memory, the software disclosed herein may be encoded on the computer-readable storage media by transforming the physical state of the semiconductor memory. For example, the software may transform the state of transistors, capacitors, or other discrete circuit elements constituting the semiconductor memory. The software also may transform the physical state of such components in order to store data thereupon.
As another example, the computer-readable storage media disclosed herein may be implemented using magnetic or optical technology. In such implementations, the software presented herein may transform the physical state of magnetic or optical media, when the software is encoded therein. These transformations may include altering the magnetic characteristics of particular locations within given magnetic media. These transformations also may include altering the physical features or characteristics of particular locations within given optical media to change the optical characteristics of those locations. Other transformations of physical media are possible without departing from the scope and spirit of the present description, with the foregoing examples provided only to facilitate this discussion.
Various exemplary embodiments of the present efficient encoding of high fanout communications are now presented by way of illustration and not as an exhaustive list of all embodiments. An example includes a method for communicating a result from a producer instruction to a plurality of consumer instructions using a fanout, the method comprising: executing the producer instruction from which a result derives; encoding two or more target instructions which enable the producer instruction to specify the plurality of consumer instructions, in which at least one of the two or more target instructions identify a move instruction; executing a plurality of move instructions using the encoded two or more target instructions; and communicating the result derived from the producer instruction to each of the consumer instructions identified from the two or more target instructions.
In another example, the method includes at least one move instruction in the plurality identifying two target instructions using full target encoding comprising specification of an explicit binary target distance between the move instruction and the target instruction. In another example, the method includes at least one move instruction identifying three or four target instructions using full target encoding comprising specification of an explicit binary target distance between the move instruction and the target instruction. In another example, the method includes at least one move instruction identifying four or more target instructions using compressed target encoding. In another example, the method includes multiple different instruction lengths being utilized to accommodate differing scenarios to realize a fanout. In another example, the method includes multiple different instruction lengths being utilized to realize a given fanout situation by a number of instructions and a size of instructions necessary to realize the fanout. In another example, the method includes the producer instruction supporting full target encoding or compressed target encoding of two or more target instructions. In another example, the method includes the producer and consumer instructions sharing a common instruction block or being in distinct instruction blocks. In another example, the method includes the target instructions being encoded using a bit vector.
A further example includes an instruction block-based microarchitecture, comprising: a control unit; and an instruction window configured to store decoded instruction blocks associated with a program to be under control of the control unit in which the control includes operations to: store a result of an executed producer instruction that includes compressed encoded targets, execute at least one move instruction that is identified as a target in the producer instruction, in which the executed at least one move instruction implements a fanout to communicate the result to each of a plurality of consumer instructions, and fetch the result for each of the consumer instructions in the fanout.
In another example, the producer instruction encodes at least two target instructions. In another example, at least one move instruction identifies at least two subsequent target instructions in the fanout. In another example, the at least one move instruction identifies one of two, three, four, eight, or 24 subsequent target instructions in the fanout. In another example, the at least one move instruction uses one of full target encoding or compressed target encoding. In another example, the at least one move instruction uses compressed target encoding using a bit position indicator where each bit in the indicator corresponds to a respective subsequent target instruction.
A further example includes one or more hardware-based non-transitory computer readable memory devices storing computer-executable instructions which, upon execution by a processor in a computing device, cause the computing device to execute a producer instruction that includes a plurality of compressed encoded targets that identify consumer instructions that comprise a fanout; place a result of the executed producer instruction in at least one operand buffer disposed in the processor; and communicate the result from the at least one operand buffer for use by each of the consumer instructions in the fanout.
In another example, the producer instruction includes a target field and the compressed encoded targets are encoded using a bit vector in the target field. In another example, the bit vector encoding specifies multiple consumer instructions based on a bit position. In another example, the bit vector is at least 4-bits in length. In another example, the processor uses an EDGE (Explicit Data Graph Execution) block-based instruction set architecture (ISA).
In light of the above, it may be appreciated that many types of physical transformations take place in the architecture 1900 in order to store and execute the software components presented herein. It also may be appreciated that the architecture 1900 may include other types of computing devices, including wearable devices, handheld computers, embedded computer systems, smartphones, PDAs, and other types of computing devices known to those skilled in the art. It is also contemplated that the architecture 1900 may not include all of the components shown in
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.