Embodiments of this invention relate to the field of processors and, in particular, to the scheduling of instructions in a processor.
Advances in microprocessor technology helped pave the way for the development of network processors (NPs), which are designed specifically to meet the requirements of next generation network equipments. In order to address the unique challenges of network processing at high speeds, i.e., where inter-arrival times between packets may be less than single memory access latency, modern network processors generally have asynchronous (non-blocking) memory access operations, so that other computation work can be overlapped with the latency of the memory accesses.
For instance, in the Intel® IXA NP family of network processors (IXP), every memory access instruction is non-blocking and is associated with an event signal; once the memory access is completed, the associated signal is asserted by the hardware. That is, when a memory access instruction is issued, other instructions following it can continue to run while the memory access is in flight, until a wait instruction (for the associated signal) blocks the execution. When the associated signal is asserted, the wait instruction clears the signal and returns to execution. Consequently, all the instructions between the memory access instruction and the wait instruction can be overlapped with the latency of the memory access, as illustrated in
Instructions that depend on the completion of the particular memory access, however, should not be executed until the associated signal is asserted, and cannot be overlapped with the latency of the memory access. For instance, an instruction that uses the result of a load instruction has to wait for the completion of the load, as illustrated in
Therefore, in order to increase the overlap of the latency, the memory access instructions and their dependent instructions should be scheduled as apart as possible. Some conventional scheduling technologies to accomplish this include list scheduling, super-block scheduling and trace scheduling.
The present invention is illustrated by way of example and not intended to be limited by the figures of the accompanying drawings.
a illustrates an asynchronous memory access operation.
b illustrates an event signal and overlap of latency.
a illustrates a load instruction and its dependent instruction.
b illustrates a store instruction and its dependent instruction.
a illustrates one embodiment of an example program.
b illustrates one embodiment of a transformation of the program illustrated in
c illustrates one embodiment of properties for program correctness.
a illustrates one embodiment of an example of a broken property when a scheduler sinks a consume s across a depend s.
b illustrates one embodiment of an example of a broken property when the scheduler sinks a consume s across a produce s.
In the following description, numerous specific details are set forth such as examples of specific systems, techniques, components, etc. in order to provide a thorough understanding of the present invention. It will be apparent, however, to one skilled in the art that these specific details need not be employed to practice the present invention. In other instances, well known components or methods have not been described in detail in order to avoid unnecessarily obscuring the present invention.
Embodiments of the present invention include various steps, which will be described below. The steps of the present invention may be performed by hardware components or may be embodied in machine-executable instructions, which may be used to cause a general-purpose or special-purpose processor programmed with the instructions to perform the steps. Alternatively, the steps may be performed by a combination of hardware and software.
Embodiments of the present invention may be provided as a computer program product, or software, that may include a machine-readable medium having stored thereon instructions, which may be used to program a computer system (or other electronic devices) to perform a process according to embodiments of the present invention. A machine readable medium includes any mechanism for storing or transmitting information in a form (e.g., software, processing application) readable by a machine (e.g., a computer). The machine-readable medium may includes, but is not limited to, magnetic storage medium (e.g., floppy diskette); optical storage medium (e.g., CD-ROM); magneto-optical storage medium; read only memory (ROM); random access memory (RAM); erasable programmable memory (e.g., EPROM and EEPROM); flash memory; electrical, optical, acoustical or other form of propagated signal (e.g., carrier waves, infrared signals, digital signals, etc.); or other type of medium suitable for storing electronic instructions.
In one embodiment, instructions in a computer program may be categorized into four classes for signal scheduling as follows: produce signal (s) instruction, consume s instruction, depend s instruction, and ignore instruction. The produce s instruction may be composed of an instruction that generates the signal s, such as a memory access instruction with signal s. Another instruction, send_signal, may be used to generate the signal as well. The consume s instruction may be composed of a wait instruction that consumes the signal s; that is, it waits for the signal s and clears the signal once it is asserted. The depend s instruction may be composed of an instruction that depends on the completion of memory accesses which also depend on the associated signals. The ignore instruction may be composed of an instruction that does not use or depend on signals and is ignored in the signal scheduling.
A method and apparatus for globally scheduling program instructions based on trace information is described. In one embodiment, a compiler selects a trace (a sequence of basic blocks) in a program, for example, either based on heuristics or actual profiling information, and schedules consume s instructions in the trace as if in a basic block. In addition, compensation codes may be used in the off-trace codes, so as to ensure the correctness of the program.
Although the access operations are discussed herein at times with particular reference to a memory access, such is only for ease of discussion purposes. It should be noted that in alternative embodiments, other types of access operations may be performed, for example, I/O access operations such as I/O reads and writes.
a illustrates an example program, where the selected trace is shown in bold lines. For scheduling, the instructions in the example program 300 of
c illustrates one embodiment of properties for program correctness. In one embodiment, a program may be guaranteed to be correct (in terms of the hardware properties of the event signal) if and only if the following properties exist. In any path from a consume s instruction to a consume s instruction, there is a produce s instruction, property 391. Once a signal s is consumed, it is automatically cleared by the hardware. Therefore, the signal has to be produced before it can be consumed again.
In any path from a produce s instruction to a produce s instruction, there is a consume s instruction, property 392. Once a signal is asserted by the hardware, it remains so until it is cleared. Therefore, to ensure the unambiguity, the signal has to be consumed before it can be produced again.
In any path from a memory access instruction from a produce s to a depend s instruction, there is a consume s instruction, property 393. This is to guarantee that the dependent instructions are issued after the completion of the memory accesses.
In any path from the source of the program to a consume s instruction there is a produce s instruction, property 394. A consume s instruction blocks the execution until the signal s is asserted by the hardware. Therefore, the signal has to be produced before it can be ever consumed. In addition, if an artificial consume s instruction is inserted at the beginning of a program, this is simply a special form of property 391.
In the step 410, consume s instructions (e.g., such as a wait instruction), are scheduled as late as possible in the trace, so long as the above four properties 391-394 in the given trace are satisfied. It is apparent that a consume s instruction cannot sink across a depend s instruction or a produce s instruction in the trace during the scheduling, as illustrated in
Therefore, the scheduler sinks the consume s instruction along the trace, until it reaches a depend s instruction or a produce s instruction. If there are not such instructions in the trace, the consume s instruction is moved to the end of the trace. For instance, the example program 301 of
In this embodiment, it is guaranteed that the above four properties 391-394 are satisfied in the trace after the first step 410 of
GEN[n]={s|instruction n is a produce s instruction}
KILL[n]={s|instruction n is a consume s or depend s instruction}
After the reaching information for each signal s is computed, steps 720 and 730 introduce a consume s instruction immediately before any produce s or depend s instruction which signal s may reach, so as to satisfy properties 392 and 393. As those two properties are already satisfied in the given trace, extra consume s instructions are only needed in the off-trace codes.
In step 740, the anticipation information for each signal s is computed using a backward conjunctive dataflow analysis. For each instruction n, the dataflow equations are as follows:
GEN[n]={s|instruction n is a consume s instruction}
KILL[n]={s|instruction n is a produce s or depend s instruction}
After the anticipation information for each signal s is computed, step 750 deletes any consume s instructions immediately after which signal s is anticipated. Hence, all the redundant consume s instructions are eliminated from the program.
For instance, after step 750, the example program 601 in
Once such a path T is found, in step 930, the method tries to find an edge (c3, c4) in the path T such that (1) any path from a produce s instruction to an edge tail node (c3) contains a consume s instruction, and (2) any path from the edge header node (c4) to a produce s instruction contains a consume s instruction.
It can be shown that such an edge (c3, c4) exits in the program as follows, as long as properties 391 and 392 are satisfied in the program:
Assume for path T=(c1, n1, n2, . . . , nk, c2), there is no such an edge.
Properties 392 and 393 are satisfied before step 930. In this step 930, additional produce s instructions are only inserted by splitting such an edge in step 940. Hence, it is guaranteed that the properties 392 and 393 are always satisfied in step 930, and step 930 can always find such an edge.
The method in step 930 keeps searching for a path from one consume s instruction (c1) to another consume s instruction (c2) without passing any produce s instructions in the program in step 920. If no such paths are found, it is guaranteed that the properties 391 and 394 are satisfied. No more compensation codes are required, and step 950 simply removes the artificial consume s instruction previously inserted in step 910. For instance, the example program 801 in
Complier 1110 may be coupled to a memory 1120 used to store the object code 1115 generated by the compiler. In one embodiment, memory 1120 may be a FLASH memory. Alternatively, other types of memories may be used, for example, a random access memory (RAM) or read only memory (ROM). The object code 1115 that is stored on memory 1120 may be loaded into processing device 1130. Processing device 1130 may execute instructions based on the object code 1115 load thereon from memory 1120.
Processing device 1130 may include on or more processors. In one embodiment, for example, processing device 1130 may be a network processor having multiple processors including a core unit and multiple microengines. In one particular embodiment, processing device 1130 may be one of the network processors in the Intel® IXA NP family of network processors. Alternatively, processing device 1130 may be another type of network processor.
In another embodiment, processing device 1130 may represent another type of processing device such as a general purpose processor (e.g., central processing unit (CPU), microprocessor) or special purpose processor (e.g., digital signal processors (DSP)), an application specific integrated circuit (ASIC), or other type of processing devices.
In the foregoing specification, the invention has been described with reference to specific exemplary embodiments thereof. It will, however, be evident that various modifications and changes may be made thereto without departing from the broader spirit and scope of the invention as set forth in the appended claims. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.