I. Field of the Disclosure
The technology of the disclosure relates generally to dataflow execution of loop instructions by out-of-order processors (OOPs).
II. Background
Many modern processors are out-of-order processors (OOPs) that are capable of dataflow execution of program instructions. Using a dataflow execution approach, the execution order of program instructions by an OOP may be determined by the availability of input data for each program instruction (“dataflow order”), rather than the program order of the program instructions. Thus, the OOP may execute a program instruction as soon as all input data for the program instruction has been generated, which may result in performance gains. For example, instead of having to “stall” (i.e., intentionally introduce a processing delay) while input data is retrieved for an older program instruction, the OOP may proceed with executing a more recently fetched instruction that is able to execute immediately. In this manner, processor clock cycles that would otherwise be wasted may be productively utilized by the OOP.
A conventional OOP may employ an instruction window, which designates a set of program instructions that may be executed out of order. When execution of a program instruction within the instruction window is complete, the results of the execution may be “committed,” or made non-speculative, and the program instruction may be retired from the instruction window to make room for a new program instruction for execution. However, in some circumstances, the eviction of program instructions from the instruction window may result in inefficient operation of the OOP. For example, if the program instructions are part of a loop, the same program instructions may be executed repeatedly over multiple loop iterations. Consequently, the program instructions may be fetched, executed, and retired repeatedly from the instruction window as the loop executes.
Performance of an OOP in the circumstances described above may be improved through the use of reservation station segments. A reservation station segment is an OOP microarchitecture feature that may store a program instruction along with related information required for execution, such as operands. The OOP may load each program instruction associated with a loop into a corresponding reservation station segment. Each reservation station segment may be configured to hold a program instruction for a specified number of loop iterations, rather than retiring the program instruction before the loop has completed. When a reservation station segment determines that all input data for its program instruction is available, the reservation station segment provides the program instruction and its input data to a processor for execution. Only after the loop has completed all iterations are the program instructions associated with the loop retired from the corresponding reservation station segments.
One issue that arises with the use of reservation station segments is managing the production of input data for program instructions with respect to consumption of the input data. If a rate at which a producer instruction generates data is greater than a rate at which a consumer instruction can utilize the data as input, the data may be lost. Alternatively, the use of additional storage or buffer mechanisms may be required, which may be expensive in terms of processor cycles and/or power consumption.
Aspects disclosed in the detailed description include providing lower-overhead management of dataflow execution of loop instructions by out-of-order processors (OOPs). Related circuits, methods, and computer-readable media are also disclosed. In this regard, in one aspect, a reservation station circuit for managing dataflow execution of loop instructions in an OOP is provided. The reservation station circuit comprises a plurality of reservation station segments. Each reservation station segment includes a loop instruction register configured to store a loop instruction. Each reservation station segment further includes an instruction execution credit indicator configured to store an instruction execution credit indicative of whether the loop instruction may be provided for dataflow execution. The reservation station circuit further comprises a dataflow monitor comprising a plurality of entries corresponding to the loop instructions of the plurality of reservation station segments. Each entry of the plurality of entries comprises a consumer count indicator indicative of a number of consumer instructions of a corresponding loop instruction, and a reservation station (RS) tag count indicator indicative of a number of executions of the consumer instructions. The dataflow monitor is configured to determine whether all of the consumer instructions of a first loop instruction have executed based on the consumer count indicator and the RS tag count indicator for the first loop instruction. The dataflow monitor is further configured to, responsive to determining that all of the consumer instructions of the first loop instruction have executed, issue an instruction execution credit to a reservation station segment of the first loop instruction. By tracking the execution of consumer instructions and issuing an instruction execution credit to a loop instruction when all consumer instructions of the loop instruction have executed, the dataflow monitor may enable management of dataflow execution of loop instructions without incurring additional overhead, such as additional buffer space.
In another aspect, a method for managing dataflow execution of loop instructions in an OOP is provided. The method comprises determining, by a dataflow monitor, whether all consumer instructions of a first loop instruction have executed. This determination is based on a consumer count indicator of the first loop instruction indicative of a number of the consumer instructions of the first loop instruction, and an RS tag count indicator of the first loop instruction indicative of a number of executions of the consumer instructions The method further comprises, responsive to determining that all of the consumer instructions of the first loop instruction have executed, issuing an instruction execution credit to a reservation station segment corresponding to the first loop instruction.
In another aspect, a non-transitory computer-readable medium is provided, having stored thereon computer-executable instructions. When executed by a processor, the computer-executable instructions cause the processor to determine whether all consumer instructions of a first loop instruction have executed. This determination is based on a consumer count indicator of the first loop instruction indicative of a number of the consumer instructions of the first loop instruction, and an RS tag count indicator of the first loop instruction indicative of a number of executions of the consumer instructions. The computer-executable instructions further cause the processor to issue an instruction execution credit to a reservation station segment corresponding to the first loop instruction, responsive to determining that all of the consumer instructions of the first loop instruction have executed.
With reference now to the drawing figures, several exemplary aspects of the present disclosure are described. The word “exemplary” is used herein to mean “serving as an example, instance, or illustration.” Any aspect described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects.
Aspects disclosed in the detailed description include providing lower-overhead management of dataflow execution of loop instructions by out-of-order processors (OOPs). Related circuits, methods, and computer-readable media are also disclosed. In this regard, in one aspect, a reservation station circuit for managing dataflow execution of loop instructions in an OOP is provided. The reservation station circuit comprises a plurality of reservation station segments. Each reservation station segment includes a loop instruction register configured to store a loop instruction. Each reservation station segment further includes an instruction execution credit indicator configured to store an instruction execution credit indicative of whether the loop instruction may be provided for dataflow execution. The reservation station circuit further comprises a dataflow monitor comprising a plurality of entries corresponding to the loop instructions of the plurality of reservation station segments. Each entry of the plurality of entries comprises a consumer count indicator indicative of a number of consumer instructions of a corresponding loop instruction, and a reservation station (RS) tag count indicator indicative of a number of executions of the consumer instructions. The dataflow monitor is configured to determine whether all of the consumer instructions of a first loop instruction have executed based on the consumer count indicator and the RS tag count indicator for the first loop instruction. The dataflow monitor is further configured to, responsive to determining that all of the consumer instructions of the first loop instruction have executed, issue an instruction execution credit to a reservation station segment of the first loop instruction. By tracking the execution of consumer instructions and issuing an instruction execution credit to a loop instruction when all consumer instructions of the loop instruction have executed, the dataflow monitor may enable management of dataflow execution of loop instructions without incurring additional overhead, such as additional buffer space.
In this regard,
In some environments, an application program may be conceptualized as a “pipeline” of kernels (i.e., specific areas of functionality), wherein each kernel operates on a stream of data tokens passing through the pipeline. The OOP 100 of
The OOP 100 is organized into one or more reservation station blocks (also referred to herein as “RSBs”), each of which may correspond to a general type of program instruction. For example, a stream RSB 104 may handle instructions for receiving data streams via a channel unit 106, as indicated by arrow 108. A compute RSB 110 may handle instructions that access one or more functional units 112 (e.g., an arithmetic logic unit (ALU) and/or a floating point unit) for carrying out computational operations, as indicated by arrow 114. Results produced by instructions in the compute RSB 110 may be consumed as input by other instructions in the compute RSB 110. A load RSB 116 handles instructions for loading data from and outputting data to a data store, such as a memory 118, as indicated by arrows 120 and 122. It is to be understood that the OOP 100 may be organized into more than one of each of the stream RSB 104, the compute RSB 110, and/or the load RSB 116. The stream RSB 104, the compute RSB 110, and the load RSB 116 include one or more reservation station segments (also referred to herein as “RSSs”) 124(0-X), 126(0-Y), and 128(0-Z), respectively. Each of the reservation station segments 124(0-X), 126(0-Y), and 128(0-Z) stores a single instruction, along with associated data required for dataflow execution of the resident instruction.
In typical operation, an input communications bus 130 communicates instructions for the kernel to be executed by the OOP 100 to an instruction unit 132 of the OOP 100, as indicated by arrow 134. The instruction unit 132 then loads the instructions into the one or more reservation station segments 124(0-X) of the stream RSB 104 (as indicated by arrow 136), the one or more reservation station segments 126(0-Y) of the compute RSB 110 (as indicated by arrow 138), and/or the one or more reservation station segments 128(0-Z) of the load RSB 116 (as indicated by arrow 140), based on the instruction type. A dataflow monitor 142 may also receive initialization data, such as a number of loop iterations to execute, as indicated by arrow 143.
The OOP 100 may then execute the resident instructions of the reservation station segments 124(0-X), 126(0-Y), and/or 128(0-Z) in any appropriate order. As a non-limiting example, the OOP 100 may execute the resident instructions of the reservation station segments 124(0-X), 126(0-Y), and/or 128(0-Z) in a dataflow execution order. The result (if any) produced by execution of each resident instruction and an identifier for the resident instruction are broadcast by the reservation station segments 124(0-X), 126(0-Y), and/or 128(0-Z), as indicated by arrows 144, 146, and 148, respectively. The reservation station segments 124(0-X), 126(0-Y), and/or 128(0-Z) then receive the broadcast data as input streams (as indicated by arrows 150, 152, and 154, respectively). The reservation station segments 124(0-X), 126(0-Y), and/or 128(0-Z) may monitor the respective input streams indicated by arrows 150, 152, and 154 to identify results from previously executed instructions that are required as input operands (not shown). Once detected, the input operands may be stored, and after all required operands are received, the resident instruction associated with the reservation station segment 124(0-X), 126(0-Y), and/or 128(0-Z) may be provided for dataflow execution. Loop instructions for a loop may thus be iteratively executed in a dataflow manner until the dataflow monitor 142 detects that all iterations of the loop have completed. Data may be streamed out of the OOP 100 to an output communications bus 156, as indicated by arrow 158.
One issue that may arise with the OOP 100 of
In this regard, the reservation station circuit 102 of
Each of the reservation station segments 124(0-X), 126(0-Y), and 128(0-Z) is associated with an instruction execution credit indicator, discussed in greater detail below with respect to
The dataflow monitor 142 is configured to issue an additional instruction execution credit 162 to each of the reservation station segments 124(0-X), 126(0-Y), and 128(0-Z) when all consumer instructions for the associated resident loop instruction have executed. To determine when the additional instruction execution credit 162 may be distributed to the reservation station segments 124(0-X), 126(0-Y), and 128(0-Z), the dataflow monitor 142 maintains entries (not shown) corresponding to each loop instruction associated with the reservation station segments 124(0-X), 126(0-Y), and 128(0-Z). Each entry includes a consumer count indicator (not shown), which is indicative of a number of consumer instructions dependent on the output of the loop instruction. Each entry further includes an RS tag count indicator (not shown), which indicates a number of times that a consumer instruction of the loop instruction corresponding to the entry has executed. As loop instructions of the reservation station segments 124(0-X), 126(0-Y), and 128(0-Z) are executed, the dataflow monitor 142 receives one or more operand source RS tags (not shown) from the reservation station segments 124(0-X), 126(0-Y), and 128(0-Z), as indicated by arrows 168, 170, and 172. Each operand source RS tag identifies a reservation station segment 124(0-X), 126(0-Y), and 128(0-Z) associated with a “producer” loop instruction that generates an operand used by the loop instruction. The dataflow monitor 142 increments the RS tag count indicator for the “producer” loop instruction corresponding to each operand source RS tag to indicate that a consumer instruction of the “producer” loop instruction has executed.
The dataflow monitor 142 may then evaluate the entries to determine whether all consumer instructions for each loop instruction have executed by comparing the consumer count indicator for each loop instruction to the corresponding RS tag count indicator. If the consumer count indicator and the RS tag count indicator are equal, the dataflow monitor 142 may conclude that all consumer instructions for the loop instruction have executed. The dataflow monitor 142 may then reset the RS tag count indicator for the loop instruction to zero (0), and issue an execution credit to the reservation station segment 124(0-X), 126(0-Y), and 128(0-Z) of the loop instruction. In this manner, the loop instruction may not be permitted to execute again until all of its consumer instructions have executed. This may enable lower-overhead management of dataflow execution of the loop instructions by, e.g., not requiring additional buffer storage space to track different operand values for different loop iterations. Elements of the entries stored by the dataflow monitor 142 are discussed in greater detail below with respect to
Aspects of the dataflow monitor 142, the stream RSB 104, the compute RSB 110, and/or the load RSB 116 may employ different techniques for detecting the completion of a loop iteration. In some aspects, an RSB (i.e., one of the stream RSB 104, the compute RSB 110, and the load RSB 116) may maintain a count of instructions that have executed during a loop iteration I. When the count of instructions executed for the loop iteration I becomes equal to a number of instructions in the RSB, the RSB communicates an end loop iteration I status (not shown) to the dataflow monitor 142. Once the dataflow monitor 142 has received an end loop iteration I status from all RSBs, the dataflow monitor 142 knows that all instructions for the loop iteration I have finished execution. The dataflow monitor 142 may then issue an additional instruction execution credit 162.
Some aspects may provide that each reservation station segment 124(0-X), 126(0-Y), and 128(0-Z) includes an end bit (not shown) that signifies whether each resident instruction is a “leaf” instruction in a dataflow ordering of the instructions (i.e., an instruction on which there are no data dependencies). When all end flag instructions have executed, a loop iteration has completed. Accordingly, each resident instruction broadcasts its end flag upon execution. The dataflow monitor 142 maintains a count of the number of end flag instruction executions for a particular loop iteration I, and the total number of end flag instructions within the loop iteration I. Once the number of end flag instruction executions for the loop iteration I becomes equal to the total number of end flag instructions, the dataflow monitor 142 may conclude that all instructions for the loop iteration I have completed execution. The dataflow monitor 142 may then issue an additional instruction execution credit 162.
The reservation station segment 200 of
The reservation station segment 200 also provides storage for data that may be required by the loop instruction 206 to execute. In the example of
Similarly, to store data associated with the second operand, the reservation station segment 200 provides an operand source RS tag 220 and an operand buffer 214(1). The operand buffer 214(1) includes one or more operand buffer entries 222(0)-222(N), and a corresponding one or more operand ready flags 224(0)-224(N). The operand source RS tag 220, the operand buffer entries 222(0)-222(N), and the operand ready flags 224(0)-224(N) may function in a manner corresponding to the functionality of the operand source RS tag 212, the operand buffer entries 216(0)-216(N), and the operand ready flags 218(0)-218(N), respectively.
The reservation station segment 200 also includes an iteration counter 226. The iteration counter 226 may be set to an initial value of zero (0), and may be subsequently incremented with each execution of the loop instruction 206. A current value of the iteration counter 226 may be provided by the reservation station segment 200 when the loop instruction 206 is provided for dataflow execution. In this manner, the current value of the iteration counter 226 may be used by subsequently-executing consumer instructions to determine the loop iteration in which the loop instruction 206 executed.
The reservation station segment 200 additionally includes an instruction execution credit indicator 228, which stores an instruction execution (“instr ex”) credit 230 distributed to the reservation station segment 200 by the dataflow monitor 142 of
In
In the example of
To illustrate how the reservation station circuit 102 of
At time interval 0, the dataflow monitor 142 of the reservation station circuit 102 distributes an initial instruction execution credit, such as the initial instruction execution credit 160 of
Because input data for the resident stream instructions of the RSS 300, the RSS 302, and the RSS 304 is readily available, the resident stream instructions effectively have no data dependencies. Therefore, the resident stream instructions associated with the RSS 300, the RSS 302, and the RSS 304 are eligible for dataflow execution. In the example of
At time interval 3, both operands for the resident multiply instruction of the RSS 306 have been received, and thus the resident multiply instruction is eligible for dataflow execution. The resident stream instruction for the RSS 304 is also eligible for dataflow execution, having an instruction execution credit greater than zero (0) and no effective data dependencies. In this example, the RSS 306 provides its resident multiply instruction to a functional unit, such as the functional unit 112 of
At time interval 4, the dataflow monitor 142 determines that the consumer count indicator for the RSS 300 (which has a value of 1, as seen in
At time interval 5, either of the resident stream instructions associated with the RSS 300 and the RSS 304 are eligible for dataflow execution. In the example of
At time interval 7, the dataflow monitor 142 determines that the consumer count indicator for the RSS 302 (which has a value of 2, as seen in
At time interval 8, the resident stream instructions associated with the RSS 300, the RSS 302, and the RSS 304 and the resident add instruction associated with the RSS 318 are each eligible for execution. In the example of
Finally, at time interval 11, the resident add instruction associated with the RSS 318 is the only instruction with an instruction execution credit greater than zero (0). As a result, while input data may be available to the resident instructions of the RSS 300, the RSS 302, the RSS 306, the RSS 308, and/or the RSS 318, none of the resident instructions may be executed again until additional credits are distributed by the dataflow monitor 142. This allows the resident instruction of the RSS 318 to “catch up” by providing time to consume the data produced by its producer instructions. Thus, at time interval 11, the RSS 318 provides its resident add instruction to the functional unit 112 for execution, and decrements its instruction execution credit to zero (0). The operand RS tags for the RSS 318 (i.e., the RS tags for the RSS 306 and the RSS 308) will also be received by the dataflow monitor 142, which increments the RS tag count indicators for the RSS 306 and the RSS 308 to one (1).
In some aspects, upon execution of the resident add instruction of the RSS 318, the dataflow monitor 142 may detect the end flag 324 of the RSS 318, and may determine that one iteration of the loop has completed. Accordingly, at time interval 11, the dataflow monitor 142 may distribute an additional instruction execution credit to each of the RSS 300, the RSS 302, the RSS 304, the RSS 306, the RSS 308, and the RSS 318 (not shown). In this case, distribution of the additional instruction execution credit would have the effect of incrementing the instruction execution credit associated with each RSS 300, 302, 304, 306, 308, and 318 to one (1). Dataflow execution of the resident instructions of the RSS 300, the RSS 302, the RSS 304, the RSS 306, the RSS 308, and the RSS 318 would then continue on in this manner.
To illustrate exemplary operations for providing lower-overhead management of loop instructions in the exemplary OOP 100 of
In
After the loop instruction 206 is provided for dataflow execution, the reservation station segment 200 may decrement the instruction execution credit 230 of the loop instruction 206 (block 606). The dataflow monitor 142 may then receive one or more operand source RS tags 212, 220 for the loop instruction 206 (block 608). The dataflow monitor 142 next may increment an RS tag count indicator 416 for one or more entries 402-412 indicated by the one or more operand source RS tags 212, 220 (block 610). Processing then resumes at block 612 of
Referring now to
Providing lower-overhead management of dataflow execution of loop instructions by OOPs, and related circuits, methods, and computer-readable media, according to aspects disclosed herein may be provided in or integrated into any processor-based device. Examples, without limitation, include a set top box, an entertainment unit, a navigation device, a communications device, a fixed location data unit, a mobile location data unit, a mobile phone, a cellular phone, a computer, a portable computer, a desktop computer, a personal digital assistant (PDA), a monitor, a computer monitor, a television, a tuner, a radio, a satellite radio, a music player, a digital music player, a portable music player, a digital video player, a video player, a digital video disc (DVD) player, and a portable digital video player.
In this regard,
Other master and slave devices can be connected to the system bus 708. As illustrated in
The CPU(s) 702 may also be configured to access the display controller(s) 722 over the system bus 708 to control information sent to one or more displays 726. The display controller(s) 722 sends information to the display(s) 726 to be displayed via one or more video processors 728, which process the information to be displayed into a format suitable for the display(s) 726. The display(s) 726 can include any type of display, including but not limited to a cathode ray tube (CRT), a liquid crystal display (LCD), a plasma display, etc.
Those of skill in the art will further appreciate that the various illustrative logical blocks, modules, circuits, and algorithms described in connection with the aspects disclosed herein may be implemented as electronic hardware, instructions stored in memory or in another computer-readable medium and executed by a processor or other processing device, or combinations of both. The master and slave devices described herein may be employed in any circuit, hardware component, integrated circuit (IC), or IC chip, as examples. Memory disclosed herein may be any type and size of memory and may be configured to store any type of information desired. To clearly illustrate this interchangeability, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. How such functionality is implemented depends upon the particular application, design choices, and/or design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
The various illustrative logical blocks, modules, and circuits described in connection with the aspects disclosed herein may be implemented or performed with a processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
The aspects disclosed herein may be embodied in hardware and in instructions that are stored in hardware, and may reside, for example, in Random Access Memory (RAM), flash memory, Read Only Memory (ROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), registers, a hard disk, a removable disk, a CD-ROM, or any other form of computer-readable medium known in the art. An exemplary storage medium is coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a remote station. In the alternative, the processor and the storage medium may reside as discrete components in a remote station, base station, or server.
It is also noted that the operational steps described in any of the exemplary aspects herein are described to provide examples and discussion. The operations described may be performed in numerous different sequences other than the illustrated sequences. Furthermore, operations described in a single operational step may actually be performed in a number of different steps. Additionally, one or more operational steps discussed in the exemplary aspects may be combined. It is to be understood that the operational steps illustrated in the flow chart diagrams may be subject to numerous different modifications as will be readily apparent to one of skill in the art. Those of skill in the art will also understand that information and signals may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof.
The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the spirit or scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The present application claims priority to U.S. Provisional Patent Application Ser. No. 62/135,738 filed on Mar. 20, 2015 and entitled “PROVIDING LOWER-OVERHEAD MANAGEMENT OF DATAFLOW EXECUTION OF LOOP INSTRUCTIONS BY OUT-OF-ORDER PROCESSORS (OOPS), AND RELATED CIRCUITS, METHODS, AND COMPUTER-READABLE MEDIA,” the contents of which is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
62135738 | Mar 2015 | US |