The present disclosure relates generally to compilers, and more particularly, to methods and apparatus to pre-execute instructions on a single thread.
In an effort to improve and optimize performance of processor systems, many different pre-fetching techniques (i.e., anticipating the need for data input requests) are used to remove or “hide” latency (i.e., delay) of processor systems. In particular, pre-fetch algorithms (i.e., pre-execution or pre-computation) are used to pre-fetch data for cache misses associated with data addresses that are difficult to predict during compile time. That is, a compiler first identifies the instructions needed to generate data addresses of the cache misses, and then speculatively pre-executes those instructions. Typically in most pre-fetch algorithms, pre-execution of instructions is performed on separate threads (i.e., multi-thread) while normal execution is performed on the main thread. In particular, a thread is information needed to serve a particular service request. For example, a thread is created when a program initiates an input/output (I/O) request such as reading a file or writing to a printer. The data kept as part of the thread allows a processor to reenter at the proper place of the program when the I/O operation is completed. Although most pre-fetch approaches are particularly well-suited for multi-thread processor systems, they may not be suitable for single-thread processor systems.
Although the following discloses example systems including, among other components, software or firmware executed on hardware, it should be noted that such systems are merely illustrative and should not be considered as limiting. For example, it is contemplated that any or all of the disclosed hardware, software, and/or firmware components could be embodied exclusively in hardware, exclusively in software, exclusively in firmware or in some combination of hardware, software, and/or firmware.
The processor system 100 illustrated in
As is conventional, the memory controller 112 performs functions that enable the processor 120 to access and communicate with a main memory 130 including a volatile memory 132 and a non-volatile memory 134 via a bus 140. The volatile memory 132 may be implemented by Synchronous Dynamic Random Access Memory (SDRAM), Dynamic Random Access Memory (DRAM), RAMBUS Dynamic Random Access Memory (RDRAM), and/or any other type of random access memory device. The non-volatile memory 134 may be implemented using flash memory, Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), and/or any other desired type of memory device.
The processor system 100 also includes an interface circuit 150 that is coupled to the bus 140. The interface circuit 150 may be implemented using any type of well known interface standard such as an Ethernet interface, a universal serial bus (USB), a third generation input/output interface (3GIO) interface, and/or any other suitable type of interface.
One or more input devices 160 are connected to the interface circuit 150. The input device(s) 160 permit a user to enter data and commands into the processor 120. For example, the input device(s) 160 may be implemented by a keyboard, a mouse, a touch-sensitive display, a track pad, a track ball, an isopoint, and/or a voice recognition system.
One or more output devices 170 are also connected to the interface circuit 150. For example, the output device(s) 170 may be implemented by display devices (e.g., a light emitting display (LED), a liquid crystal display (LCD), a cathode ray tube (CRT) display, a printer and/or speakers). The interface circuit 150, thus, typically includes, among other things, a graphics driver card.
The processor system 100 also includes one or more mass storage devices 180 configured to store software and data. Examples of such mass storage device(s) 180 include floppy disks and drives, hard disk drives, compact disks and drives, and digital versatile disks (DVD) and drives.
The interface circuit 150 also includes a communication device such as a modem or a network interface card to facilitate exchange of data with external computers via a network. The communication link between the processor system 100 and the network may be any type of network connection such as an Ethernet connection, a digital subscriber line (DSL), a telephone line, a cellular telephone system, a coaxial cable, etc.
Access to the input device(s) 160, the output device(s) 170, the mass storage device(s) 180 and/or the network is typically controlled by the I/O controller 114 in a conventional manner. In particular, the I/O controller 114 performs functions that enable the processor 120 to communicate with the input device(s) 160, the output device(s) 170, the mass storage device(s) 180 and/or the network via the bus 140 and the interface circuit 150.
While the components shown in
In the example of
The original code 210 (e.g., described in detailed below and shown as 400 in
The instruction identifier 220 is configured to identify one or more instructions associated with a latency condition(s) in the original code 210. That is, one or more instructions associated with the latency condition(s), such as one or more instructions associated with cache misses, which are requests by code to read from memory that cannot be satisfied from the cache 270 (e.g., one shown as 122 in
Referring back to
After a latency instruction has been identified, the slice identifier 230 is configured to identify a slice (i.e., a collection) of instructions associated with the latency instruction. In particular, the slice of instructions includes one or more instructions configured to generate a data address associated with the latency instruction. The data address may be stored in a register and/or any other data structure that passes data from one or more instructions and/or programs to another. Because the data addresses associated with the slice of instructions are dependent on the data address associated with the latency instruction, a group of one or more instructions are identified as the slice.
In general and as described in detail below, the slice identifier 230 starts with identifying an innermost loop associated with the latency instruction. While the methods and apparatus disclosed herein are particularly well suited to identify the innermost loop, persons of ordinary skill in the art will appreciate that the teachings of the disclosure may be applied to identify an outer loop associated with the latency instruction as well.
Within the innermost loop, the slice identifier 230 identifies a base register (i.e., the register of the first instruction of the slice), and tracks backward to identify other registers associated with the base register until it identifies a register that holds an induction variable (e.g., i=i+1), a recurrent load (e.g., p=p→next), or a loop invariant register. In particular, an induction variable increments or decrements by a constant every time the variable changes value. For example, a recurrent load produces a data address consumed by future instances of that data address itself. Recurrent loads are typically used as induction variables in loops. As noted above, the slice identifier 230 also stops tracking for other registers when it identifies an instruction associated with a register that is loop invariant within the loop (i.e., constant).
The slice of instructions may be pre-executed by a number of iterations to compensate for stall cycles associated with the cache. That is, the induction variable or the recurrent load of the loop may be adjusted to include a pre-execution distance so that the slice of instructions is pre-executed. As an example, for a latency instruction associated with a cache having two stall cycles, the induction variable or the recurrent load may be set so that the slice of instructions is pre-executed two cycles ahead. The pre-execution distance may be pre-set and/or calculated to compensate for the stall cycles.
The slot identifier 240 is configured to identify computation resources available to pre-execute the slice of instructions responsible for latency. In particular, the slot identifier 240 identifies one or more instruction slots within the original code 210 where one or more code configured to execute the slice of instructions (i.e., pre-execution code) may be inserted as described in detail below. For example, the original code 210 may include “no ops” (i.e., instructions that specify no operation), which serve as placeholders that may be replaced by the pre-execution code. Alternatively, the original code 210 may include instruction slots in dynamic form (e.g., stalled cycles) rather than in static form as in explicit “no ops.” The compiler 260 is configured to identify the instruction slots in dynamic form within the original code 210.
The code generator 250 is configured to generate the pre-execution code, the goal of which is to reduce latency associated with cache misses. In particular, the pre-execution code may include instructions that utilized different registers than the original code 210 to avoid corrupting register values (e.g., data addresses) in registers associated with the original code 210. Based on whether the result of a load instruction in the slice is required to continue the pre-execution, a speculative load (e.g., ld.s) or a pre-fetch (e.g., 1fetch) instruction(s) corresponding to that load instruction may be generated in the pre-execute code as described in detail below. In general, the pre-execution code produced by the code generator 250 is inserted into the instruction slots identified by the slot identifier 240 so that the compiler 260 may pre-execute the latency instruction on a single thread.
In the example of
As noted above, the original set of code 300 includes a plurality of no ops 305, 315, 325, and 335. The no ops serve as placeholders within the original set of code 300 where the pre-execution code (i.e., code configured to execute the slice of instructions) may be inserted. In the example of
The code generator 250 generates either a speculative load (i.e., ld.s) or a pre-fetch (i.e., 1fetch) corresponding to each load instruction based on whether the load result of that load instruction is required to continue the pre-execution of the latency instruction 330. For example, instruction 430 (i.e., 1fetch [R41]) is generated as a pre-fetch instruction to correspond to the load instruction 330 (i.e., ld [R40]) because the value of register R41 is not dependent on the load result of the instruction 430 (i.e., the data address associated with register R41 is simply loaded). In another example, instruction 410 (i.e., R31=ld.s [R21]) is generated as a speculative load instruction to correspond to the load instruction 310 (i.e., R30=Id [R20]) because the load result of the load instruction 410 (i.e., register R31) is required to continue the pre-execution. That is, the value of register R31 is required to determine the value of register R41 in the instruction 420 (i.e., instruction 420 is dependent on instruction 410).
Further, the induction variable or the recurrent load includes a pre-execution distance (i.e., a number of iterations) to avoid the cache miss latency of the load instruction 330. Accordingly, the value of register R41 is determined before it is needed. In instruction 440 (i.e., R21=R20+8*5), for example, the pre-execution distance is five. That is, the induction variable of eight is multiplied by five so that the pre-execution code (i.e., code to execute instructions 410, 420, 430, and 440) is executed five iterations prior to when the value of register R41 is needed. As a result, the compiler 270 may pre-fetch data associated with cache misses on a single thread.
Machine readable instructions that may be executed by the processor system 100 (e.g., via the processor 120) are illustrated in
In the example of
The processor 120 also identifies one or more instructions configured to generate a data address associated with the latency instruction (i.e., a slice of instructions) (block 520). In the slice of instructions, the processor 120 includes instructions within a loop associated with the latency instruction until an instruction associated with an induction variable (e.g., i=i+1) or a recurrent load (e.g., p=p→next) is identified. Alternatively, the processor 120 includes instructions from within the loop until an instruction associated with a loop invariant register (i.e., a register that is constant within the loop) is identified.
The processor 120 then identifies at least one instruction slot within the loop to insert code configured to execute the slice of instructions (i.e., pre-execution code) (block 530). For example, the processor 120 may identify no ops within the loop and replace the no ops with the pre-execution code. The processor 120 generates the pre-execution code within the at least one instruction slot (block 540). In particular, the processor 120 generates code to include instructions with different registers so that register values (e.g., data addresses) in registers associated with the original set of code are not corrupted. Further, a speculative load (e.g., ld.s) or a pre-fetch (e.g., 1fetch) instruction corresponding to a load instruction may be generated based on whether the load result of a load instruction in the slice is required to continue the pre-execution. Thus, the processor 120 may pre-fetch the data address associated with the latency instruction on a single thread.
Although certain example methods, apparatus, and articles of manufacture have been described herein, the scope of coverage of this patent is not limited thereto. On the contrary, this patent covers all methods, apparatus, and articles of manufacture fairly falling within the scope of the appended claims either literally or under the doctrine of equivalents.