The embodiments of the invention relate generally to compilers and, more specifically, relate to elimination of redundant store-load instructions of a stack-based language by a compiler.
Stack-based languages are used as general and special-purpose programming languages. They are popular as intermediate languages for compilers, and they are popular as machine-independent executable program representations. Examples of stack-based languages include, but are not limited to, Forth, PostScript, Java bytecode, and Microsoft Intermediate Language (MSIL).
On a register-based machine, a pair of store and load instructions that access the same local variable that does not live out of the current basic block of code may be removed if the pair does not violate any data dependencies. A basic block includes a segment of machine code with a single control flow entry and a single exit. A data dependency occurs when an instruction depends on the results of a previous instruction. On a stack-based machine, the same pair of store and load instructions may not be so easily removed. Without considering the instruction order, the simple elimination of store and load instructions would affect the stack balance because the order of instructions running on a stack-based machine is implicit to the data dependence of data elements in the stack.
In the case of a stack-based machine, translation from a register-based code to a stack-based code might produce many of such redundant store and load instructions. To make the code size smaller and more efficient to translate or execute, these redundant store and load instructions should be removed as much as possible. However, as noted above, arbitrary elimination of such redundant store-load pairs is not an option for stack-based languages.
The invention will be understood more fully from the detailed description given below and from the accompanying drawings of various embodiments of the invention. The drawings, however, should not be taken to limit the invention to the specific embodiments, but are for explanation and understanding only.
A method and apparatus to eliminate redundant store-load instructions based on stack location insensitive sequences are described. Reference in the specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the invention. The appearances of the phrase “in one embodiment” in various places in the specification are not necessarily all referring to the same embodiment.
In the following description, numerous details are set forth. It will be apparent, however, to one skilled in the art, that the embodiments of the invention may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form, rather than in detail, in order to avoid obscuring the invention.
Embodiments of the invention introduce a framework to remove redundant store and load instructions in order to optimize the code of a stack-based language. Embodiments of the invention are based on stack-based code, rather than register-based code, and may be applied on any stack-based code optimization.
In order to remove the redundant store-load instructions, embodiments of the invention split code sequences into pieces, and then reorder those pieces while keeping the stack balance unchanged and not violating any data dependencies within the sequence. The concept of a Stack Location Insensitive Sequence (SLIS) is introduced in order to perform the redundant store-load instruction optimization of embodiments of the invention. Through detecting SLISs, redundant store-load pairs can be located in order to reduce code size and improve system performance. First, a SLIS is described.
Stack Location Independent Sequence (SLIS)
An instruction running on a stack machine might change the stack state by pushing an element onto or popping an element off of the stack. The term StackVariation is defined herein as the total stack depth variation caused by a segment of sequential instructions. For example, StackVariation (STLOC)=−1, and StackVariation (LDLOC)=1. STLOC stands for a store to a local variable, and LDLOC stands for a load from a local variable.
Within a single basic block of code, a segment of sequential instructions, S, is defined as a SLIS if it meets the following two requirements:
For example, in
In order to perform embodiments of the invention, SLISs within an instruction sequence should be identified. Referring to
At processing block 310, the StackVariation of X should be determined. For an instruction, X, StackVariation (X)=PUSH (X)−POP (X), where PUSH (X) is defined as the number of elements X pushes onto a stack, and POP (X) is defined as the number of elements X pops off of a stack. For example, for the instruction “add” in MSIL, PUSH (add)=1 and POP (add)=2. As a result, StackVariation (add)=−1. At processing block 320, X is added into instruction sequence S.
At decision block 330, a search forward is performed to observe the next instruction, Y, for the current SLIS candidate instruction sequence, S. The current StackVariation (S) is checked to determine if there are enough stack elements to equal POP (Y). If the stack elements meet the requirement of POP (Y), in other words, StackVariation (S)==POP (Y), then a SLIS has been found as shown in processing block 340.
At decision block 350, it is determined whether POP (Y)>StackVariation (S). If POP (Y)>StackVariation (S), then the method continues to processing block 360, where it is determined whether the previous instruction before instruction sequence S is within the instruction scope to be checked. If it is, then the method returns to processing block 320, where that previous instruction is added to instruction sequence S. If the previous instruction is not within the scope to be check, then the process ends without an SLIS found.
If, at decision block 350, POP (Y)<StackVariation (S), then Y is put into S at processing block 370. Then, at decision block 380, it is determined whether the next instruction past S is within the instruction scope to be checked. If it is, then the method returns to decision block 330 and the next instruction, now instruction Y, is checked to determine whether POP (Y)==StackVariation (S). If the next instruction is not within the scope to be checked, then the process ends without a SLIS found.
Significantly, a SLIS has the property of being able to be arbitrarily moved upward or downward within a basic block as long as it does not violate any data dependencies in the basic block. This is because a SLIS keeps the stack state unchanged before and after it is executed.
Method to Remove Redundant Load/Store Instructions Based on SLIS
As a SLIS has the property of being able to be moved up or down without affecting stack balance, it plays a key role in optimizing stack-based code. One embodiment of a method to eliminate redundant store-load instructions by utilizing SLISs is presented below.
At processing block 410, the data dependencies within a sequence of instructions are determined. Then, at processing block 420, a redundant store-load instruction pair within the sequence of instructions is identified. At processing block 430, one or more stack location insensitive sequences (SLISs) between the redundant store-load instruction pair are identified that encompass one or more instructions that have a data dependency with an instruction prior to the store instruction of the store-load instruction pair.
At processing block 440, the one or more SLISs of the instruction sequence that includes the redundant store-load pair are reordered, based on the data dependencies, so that the two instructions of the redundant store-load pair are immediately adjacent to each other. Finally, at processing block 450, the redundant store-load pair is removed from the instruction sequence.
The following description and figures describe a more detailed embodiment of the method 400 presented above. All of the descriptions below are within the scope of a basic block and MSIL bytecode is used to represent stack-based instructions. Of course, it should be understood that embodiments of the invention are not limited to such an implementation.
Firstly, the following symbols are defined,
One embodiment of the invention utilizes a predicate, StoreLoadPair (X, Y). It is true if:
This predicate implicates that the store-load pair that returns ‘true’ can be optimized, and also has the fewest instructions between the store-load instruction pair. Those candidate redundant store-load pairs that make StoreLoadPair (X, Y) true should be identified and removed.
For example, assume the following instruction sequence pattern for a given StoreLoadPair (St, Ld): ZSt{[Xi] [Yi], i=1 . . . n} Ld
The square bracket enclosing Xi means that Xi may or may not exist. The same applies to Yi. For example, there may be a sequence, Z StX1Y1Y2Ld, in which X2 does not exist.
In the sequence pattern above, if there are some Xs and Ys between St and Ld, and the instruction sequence satisfies the following rules:
Rule (G) indicates Xi does not have data dependence with Yj if j<i, and simultaneously rule (D) indicates Yj is a SLIS. Therefore, the instructions in Yj may exchange position with the instructions in Xi, which does not change the stack balance. The transformation in (1) illustrates that all of Ys can be moved to the position that is subsequent to all Xs. According to the definition of the predicate StoreLoadPair, rules (A) and (F) indicate ZSt does not have data dependence with {Xi, i=1 . . . n}, and simultaneously rule (B) indicates ZSt is a SLIS. Therefore, ZSt can exchange position with {Xi, i=1 . . . n} without affecting the stack balance as seen in transformation (2).
Similarily, rule (D) indicates {Yi, i=1 . . . n} is also a SLIS, and rule (A) guarantees St does not have data dependences with the Ys. Therefore, the Ys may be moved to a position prior to St, as seen in transformation (3). Finally, sequencial St Ld may be eliminated safely as seen in transformation (4). Notice that rules (C) and (E) are not applied for any of the transformations mentioned above, because they were used for detecting the instruction sequence pattern before the transformation.
In summation, embodiments of the invention may employ a method such as that described below:
(1): Analyze data dependences. A data dependence matrix may be constructed for all of the memory access instructions within a current basic block. The data dependences that could be analyzed include Read-Write (RW), Write-Read (WR), and Write-Write (WW). One skilled in the art will appreciate that any popular data dependence analysis method can be applied to determine the data dependencies within an instruction sequence at this point.
(2): Determine all Store-Load Pairs. Each instruction within the current basic block is examined one by one. If a store instruction is found, then go forward to find a load instruction that could make the predicate StoreLoadPair for that store-load instruction pair true. For example, without considering data dependences, in the instruction sequence { . . . STLOC.1 . . . LDLOC.1 . . . STLOC.1 . . . LDLOC.1 . . . }, the pair including the first STLOC.1 and the first LDLOC.1 should be found first, rather than the pair with the first STLOC.1 and the second LDLOC.1. Then, a store-load pair list including all of the discovered store-load pair candidates within a basic block is constructed.
(3): Split instruction sequence into SLISs. A store-load pair that has not been analyzed yet is selected from the StoreLoadPair list constructed in (2). Then, an instruction sequence, Z, ending with a STLOC instruction is located, which makes {Z, STLOC} a SLIS. Finally, the instructions that are in between the store instruction and the load instruction are analyzed to identify other SLISs that enclose the instructions that have data dependencies with Z.
For example, as shown in
(4): Instruction reordering. All of the SLISs identified in (3) should be able to be moved while keeping the relative sequence unchanged. As shown in
After an instruction sequence has been reordered, any store-load pair that has not been analyzed yet may be analyzed by returning to (3) to continue splitting instruction sequences into SLISs for a new store-load pair. Once instruction reordering is finished, the method may be repeated from the beginning to find more store-load pairs that were, in the prior analysis, not considered removable.
Instructions (4) and (10) of basic block 720 are a candidate redundant store-load pair that may be removed. To remove this pair, the instructions in basic block 720 should be split into several parts according to the method described above, and the SLISs within the instruction sequence should be identified. Then, the instructions of basic block 720 may be reordered and the redundant store-load pair removed.
Alternative embodiments of the invention may utilize other optimizations based on identifying a SLIS. For example, an optimization to reduce code size by generating more duplicate (DUP) instructions through utilization of SLISs may be employed. Embodiments of the invention may also be applied as a redundant store-load optimization to any stack-based code product. For example, redundant store-load optimization of floating point stack code may be performed. One skilled in the art will appreciate the range of options for utilizing a SLIS to optimize code.
Embodiments of the invention may be implemented in a variety of compilers. For instance, implementation is possible in a static compiler or a just-in-time (JIT) compiler.
The time complexity of embodiments of the invention may be lower than other optimization methods. Ordinarily, data dependence information is updated by other optimizations. Therefore, the optimization presented here need not compute it again and may utilize that data dependence information directly. Furthermore, the reordering overhead penalty from repeated reordering is avoided by detecting the SLISs. The instruction pattern for each store-load pair is detected once by examining the instructions between a store and load instruction pair, thereby avoiding the repeated reordering.
Processor bus 1112, also known as the host bus or the front side bus, may be used to couple the processors 1102-1106 with the system interface 1114. Processor bus 1112 may include a control bus 1132, an address bus 1134, and a data bus 1136. The control bus 1132, the address bus 1134, and the data bus 1136 may be multidrop bi-directional buses, e.g., connected to three or more bus agents, as opposed to a point-to-point bus, which may be connected only between two bus agents.
System interface 1114 (or chipset) may be connected to the processor bus 1112 to interface other components of the system 1100 with the processor bus 1112. For example, system interface 1114 may include a memory controller 1118 for interfacing a main memory 1116 with the processor bus 1112. The main memory 1116 typically includes one or more memory cards and a control circuit (not shown). System interface 1114 may also include an input/output (I/O) interface 1120 to interface one or more I/O bridges or I/O devices with the processor bus 1112. For example, as illustrated, the I/O interface 1120 may interface an I/O bridge 1124 with the processor bus 1112. I/O bridge 1124 may operate as a bus bridge to interface between the system interface 1114 and an I/O bus 1126. One or more I/O controllers and/or I/O devices may be connected with the I/O bus 1126, such as I/O controller 1128 and I/O device 1130, as illustrated. I/O bus 1126 may include a peripheral component interconnect (PCI) bus or other type of I/O bus.
System 1100 may include a dynamic storage device, referred to as main memory 1116, or a random access memory (RAM) or other devices coupled to the processor bus 1112 for storing information and instructions to be executed by the processors 1102-1106. Main memory 1116 may also be used for storing temporary variables or other intermediate information during execution of instructions by the processors 1102-1106. System 1100 may include a read only memory (ROM) and/or other static storage device coupled to the processor bus 1112 for storing static information and instructions for the processors 1102-1106.
Main memory 1116 or dynamic storage device may include a magnetic disk or an optical disc for storing information and instructions. I/O device 1130 may include a display device (not shown), such as a cathode ray tube (CRT) or liquid crystal display (LCD), for displaying information to an end user. For example, graphical and/or textual indications of installation status, time remaining in the trial period, and other information may be presented to the prospective purchaser on the display device. I/O device 1130 may also include an input device (not shown), such as an alphanumeric input device, including alphanumeric and other keys for communicating information and/or command selections to the processors 1102-1106. Another type of user input device includes cursor control, such as a mouse, a trackball, or cursor direction keys for communicating direction information and command selections to the processors 1102-1106 and for controlling cursor movement on the display device.
System 1100 may also include a communication device (not shown), such as a modem, a network interface card, or other well-known interface devices, such as those used for coupling to Ethernet, token ring, or other types of physical attachment for purposes of providing a communication link to support a local or wide area network, for example. Stated differently, the system 1100 may be coupled with a number of clients and/or servers via a conventional network infrastructure, such as a company's Intranet and/or the Internet, for example.
It is appreciated that a lesser or more equipped system than the example described above may be desirable for certain implementations. Therefore, the configuration of system 1100 may vary from implementation to implementation depending upon numerous factors, such as price constraints, performance requirements, technological improvements, and/or other circumstances.
It should be noted that, while the embodiments described herein may be performed under the control of a programmed processor, such as processors 1102-1106, in alternative embodiments, the embodiments may be fully or partially implemented by any programmable or hardcoded logic, such as field programmable gate arrays (FPGAs), transistor transistor logic (TTL) logic, or application specific integrated circuits (ASICs). Additionally, the embodiments of the invention may be performed by any combination of programmed general-purpose computer components and/or custom hardware components. Therefore, nothing disclosed herein should be construed as limiting the various embodiments of the invention to a particular embodiment wherein the recited embodiments may be performed by a specific combination of hardware components.
In the above description, numerous specific details such as logic implementations, opcodes, resource partitioning, resource sharing, and resource duplication implementations, types and interrelationships of system components, and logic partitioning/integration choices may be set forth in order to provide a more thorough understanding of various embodiments of the invention. It will be appreciated, however, to one skilled in the art that the embodiments of the invention may be practiced without such specific details, based on the disclosure provided. In other instances, control structures, gate level circuits and full software instruction sequences have not been shown in detail in order not to obscure the invention. Those of ordinary skill in the art, with the included descriptions, will be able to implement appropriate functionality without undue experimentation.
The various embodiments of the invention set forth above may be performed by hardware components or may be embodied in machine-executable instructions, which may be used to cause a general-purpose or special-purpose processor or a machine or logic circuits programmed with the instructions to perform the various embodiments. Alternatively, the various embodiments may be performed by a combination of hardware and software.
Various embodiments of the invention may be provided as a computer program product, which may include a machine-readable medium having stored thereon instructions, which may be used to program a computer (or other electronic devices) to perform a process according to various embodiments of the invention. The machine-readable medium may include, but is not limited to, floppy diskette, optical disk, compact disk-read-only memory (CD-ROM), magneto-optical disk, read-only memory (ROM) random access memory (RAM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), magnetic or optical card, flash memory, or another type of media/machine-readable medium suitable for storing electronic instructions. Moreover, various embodiments of the invention may also be downloaded as a computer program product, wherein the program may be transferred from a remote computer to a requesting computer by way of data signals embodied in a carrier wave or other propagation medium via a communication link (e.g., a modem or network connection).
Similarly, it should be appreciated that in the foregoing description, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure aiding in the understanding of one or more of the various inventive aspects. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Whereas many alterations and modifications of the invention will no doubt become apparent to a person of ordinary skill in the art after having read the foregoing description, it is to be understood that any particular embodiment shown and described by way of illustration is in no way intended to be considered limiting. Therefore, references to details of various embodiments are not intended to limit the scope of the claims, which in themselves recite only those features regarded as the invention.