The present invention relates to the field of computers, communications, and integrated circuits.
The most advanced processors use multi-issue technology to improve the performance. The front end of the multi-issue processor can provide multiple instructions to the processor core in one clock cycle. The multi-issue front end contains an instruction memory with a sufficient bandwidth to provide a plurality of instructions in one clock cycle and the instruction pointer (IP) can be moved to the next position at a time. The front end of the multi-issue processor can effectively handle fixed-length instructions, but the situation is complicated when handling variable-length instructions. A good solution is to convert the variable-length instructions into fixed-length micro-operations (μOps) and then the processor front-end issues them to the execution. The number of μOps obtained by the conversion can differ from the number of instructions, since the length of the instructions varies. It is difficult to produce a simple and clear relationship between an instruction address (IP) and a μOp address,
The above problem makes it difficult to locate the μOp address corresponding to the program entry. For example, for a branch target of a branch instruction, the processor gives the instruction address (IP) instead of the μOp address. The prior art solution is to align the address of the μOp corresponding to the program entry to the cache block boundary which stores the μOp, rather than aligning the 2n address with the block boundary. Ops are stored into a μOp cache to be sent by the processor front end for the execution of the processor core. Wherein the L1 cache 11 is used to store instructions, whose corresponding tag unit 10 is used to store the tag portion of the instruction address. The instruction convertor 12 is used to convert the instruction to a micro operation (μOp). A micro operation cache (
Op cache) 14 is used to store the μOp of the conversion, and the corresponding tag unit 13 is used to store the instruction tag and the offset as well as the byte length of the instruction corresponding to the μOp stored in the μOp cache 14. Level 1 tag unit 10, L1 cache 11, tag unit 13, and
Op cache 14 are all addressed by the index portion of the instruction address. The processor core 28 produces an instruction address 18, and also an branch instruction address 47 which addresses the branch target buffer (BTB) 27. BTB 27 outputs branch judgment signal 15 to control the selector 25. When the branch prediction signal 15 from BTB 27 is ‘0’ (which means no branching), selector 25 chooses instruction address 18; when the signal is (which means branching), selector 25 chooses the branch target instruction address 17 from the output of BTB 27. The instruction address 19 output by selector 25 is then sent to the tag unit 10, L1 cache 11, tag unit 13, and
Op cache 14. According to the index part of instruction address 19, a set of context scans are selected from both target unit 13 and
Op cache 14. The tag portion and the offset from instruction address 19 can be matched with the tag portion and the offset stored in all the ways in the content set read from tag unit 13. If there is a match, the output hit-signal 16 controls the selector 26 to choose the plurality μOps in the corresponding way in the set of content output by the
Op cache 14. If no match is successful, the output hit-signal 16 controls the selector 26 to select the output of the instruction converter 12, which waits for the instruction address 19 to match the level 1 tag unit 10, and the plural instructions read from the L1 cache are converted into plural numbers of μOp and then stored in the μOp buffer 14 and at the same time output by selector 26 to the processor core 28 for execution. The instruction address and instruction length corresponding to those
Ops are also stored in the
Op tag unit 13. The byte length of the instruction, which corresponds to the plural
Ops stored in the ways hit by tag unit 13, is also sent to processor core 28 via bus 29, thus allowing the instruction address adder to add the byte address and the original instruction address to obtain the address of the next instruction. In some microprocessors, the instruction address generator and BTB are combined into separate branch units, but the principle is the same as above, and therefore no further explanation is made.
The disadvantage of the above technique is that each instruction block in the L1 cache may correspond to a plurality of program entry points, and each program entry point occupies one way of the tag unit 13 and the μOp cache 14, so that the contents of the tag units 13 and the μOp cache 14 are too fragmented. For example, a tag corresponding to an instruction block containing 16 instructions is ‘T’, where the instructions corresponding to bytes ‘3’, ‘6’, ‘8’, ‘11’ and ‘15’ are all program entry points. At this point, the instruction block occupies only one of the tag units 10 to store the tag ‘T’ and occupies only one way of the L1 cache 11 to store the corresponding instruction. However, the μOp obtained from the conversion of this instruction block requires occupying 5 ways in tag unit 13, respectively storing the tags and the offsets ‘T3’, ‘T6’, ‘T8’, ‘T11’ and ‘T15’ (the locations of these tags of 5 ways in tag unit 13 could be discontinuous). Store all of the complete Ops into the corresponding 5 ways of
Op cache 14, starting from the corresponding program entries and the full capacity of their ways. If the corresponding μOp of an instruction cannot fit in the remaining capacity of a μOp block in a way, another way needs to be allocated. This caching organization causes duplication of the μOp tag in the tag unit 13, which also creates a dilemma. A larger μOp cache 14 block size, will cause more duplication, thus reducing effective capacities. A smaller μOp cache block size causes severe fragmentation. These shortcomings result in current processors using the above technology have a smaller cache capacity relative to the L1 cache, and contains duplication in
Op cache, thus making the effective capacity to further reduce, resulting in a cache miss rate greater than about 20%. The μOp cache's high miss rate, the high latency of instruction conversion when a miss occurs, and repeatedly converting the instructions all contribute to the high consumption and inefficiency of this type of processors. The same is true for other cache organizations such as trace cache and block cache.
This application discloses a method and system which directly solve one or more of the above, or other problems.
The present invention provides a multi-issue processor system comprising: a front-end module and a back-end module, wherein the said front-end module further comprises: an instruction converter for converting instructions into μOps and generating mapping relationships between instruction addresses and μOp addresses; L1 cache, used to store the converted μOps, and send plural Ops to back-end module for execution based on the instruction address sent by the back-end module; a tag unit, used to store the tag portion of the instruction address corresponding to the
Ops in the L1 cache; a mapping unit consisting of a storage unit and a logical operation unit; wherein the storage unit stores the mapping relationship of the μOp addresses in L1 cache and the addresses of instructions corresponding to those
Ops; and the logical operation unit converts instruction addresses into μOp addresses or converts μOp addresses into instruction addresses according to the mapping relationship; the back-end module includes at least one processor core for executing μOps sent by the front-end, and produce the next instruction address sent to the front-end module.
The present invention also discloses a multi-issue processor method, wherein the following method is embedded in the front-end module: converting the instruction into μOps and generating a mapping relationship between the instruction address and the μOp addresses; Storing the converted μOps in the level 1 cache and outputting a plural μOps to the back-end module according to the instruction address sent from the backend module; storing the tag portion of the instruction address corresponding to the μOps in level1 cache; storing a mapping relationship between the addresses of the μOps in level1 cache and the addresses of the instructions corresponding to those μOps; converting the instruction addresses into μOp addresses or converting the μOp addresses into instruction addresses according to the mapping relationship; The back-end module executes a plural μOps sent by the front-end module and sends the next instruction address to the front-end module based to the execution result.
The present invention also provides a multi-issue processor system comprising: a front-end module and a back-end module; wherein that the back-end module includes at least one processor core for executing a plurality of instructions sent by the front-end module, and generate the next instruction address to the front-end module; The front end module further comprises: a level1 cache for storing instructions and outputting a plurality of instructions to the back-end module for execution according to the instruction address sent from the back-end module; a tag unit for storing a tag portion of an instruction address corresponding to an instruction in the level1 cache; A level 2 cache for storing all instructions stored in the L1 cache, branch target instructions for all branch instructions in the level1 cache, and the sequential next instruction block of each instruction block in level1 cache; a scanner for reviewing instructions from the level2 cache to the level1 cache or instructions converted from the method described above, extracting the corresponding instruction information and calculating the branch target address of the branch instruction; a track table for storing the location information of all the instructions in the L1 cache, the branch target location information of the branch instruction, and the sequential next instruction block location information of level 1 instruction blocks. The said location information of the branch target or the sequential next block address is the location information of the corresponding branch target instruction in the level1 cache, if the branch target or the next block of the sequential address is already stored in the L1 cache. The location information of the branch target or the sequential next block is the location information of the corresponding instruction stored in the level2 cache, if the branch target is not yet stored in the L1 cache,
The present invention also provides a multi-issue processor method, wherein: the back-end module sends the next instruction address to the front-end module by executing a plurality of instructions sent by the front-end module; in the front-end module: Storing instructions in the L1 cache and outputting a plurality of instructions to the back-end module for execution based on the instruction address sent from the backend module; storing the tag portion of the instructions address corresponding to the instruction in level1 cache; store all instructions stored in the L1 cache, branch target instructions for all branch instructions in the level1 cache, and the sequential next instruction block of each instruction block in level1 cache; scans the instructions from the level cache to the level 1 cache or instructions converted by instruction conversion, and extract the corresponding instruction information and calculate the branch target address of the branch instruction; store to track table the location information of all the instructions in the L1 cache, the branch target location information of the branch instruction, and the sequential next instruction block location information of level 1 instruction blocks. The said location information of the branch target or the sequential next block address is the location information of the corresponding branch target instruction in the level1 cache, if the branch target or the next block of the sequential address is already stored in the L1 cache. The location information of the branch target or the sequential next block is the location information of the corresponding instruction stored in the level2 cache, if the branch target is not yet stored in the L1 cache.
Other aspects of the invention may be understood and appreciated by those skilled in the art from the description, claims and drawings of the present invention.
The system and method of the present invention may provide a basic solution for the cache structure used by the variable-length instruction multi-issue processor system. In the traditional variable-length instruction processor, the address relationship between the instructions and the μOps is difficult to determine, and the number of μOps obtained by the instruction conversion of the fixed byte length is different, resulting in low memory efficiency and low hit rate of the cache system. According to the invention, the system and method establish a mapping relationship between the instruction addresses and the micro-operation addresses, and the instruction addresses can be directly converted into μOp addresses according to the mapping relation and read out the required μOps from the cache accordingly, thus improve cache efficiency and hit rate.
The system and method of the present invention can also fill the instruction cache before the processor executes an instruction to avoid or sufficiently hide cache misses.
The system and the method of the invention also provide a branch instruction selection technique based on the branch prediction bit, which avoids the access of the branch target buffer in the traditional branch prediction technology, thus not only saving the hardware, but also improving the branch prediction efficiency.
In addition, the system and method of the present invention also provides a branch processing technique without performance loss. branch prediction, The system and method eliminates branch penalty without employing branch prediction,
Other advantages and applications of the present invention will be apparent to those skilled in the art.
Op branches provided by both instructions read buffer and L1 cache simultaneously;
Ops to a processor core at a same time.
Ops to a processor core at a same time.
Reference will now be made in detail to exemplary embodiments of the invention, which are illustrated in the accompanying drawings in connection with the exemplary embodiments. By referring to the description and claims, features and merits of the present invention will be clearer to understand. It should be noted that all the accompanying drawings use very simplified forms and use non-precise proportions, only for the purpose of conveniently and clearly explaining the embodiments of this disclosure.
It is noted that, in order to clearly illustrate the contents of the present disclosure, multiple embodiments are provided to further interpret different implementations of this disclosure, where the multiple embodiments are enumerated rather than listing all possible implementations. In addition, for the sake of simplicity, contents mentioned in the previous embodiments are often omitted in the following embodiments. Therefore, the contents that are not mentioned in the following embodiments can be referred to in the previous embodiments.
Although this disclosure may be expanded using various forms of modifications and alterations, the specification also lists a number of specific embodiments to explain in detail. It should be understood that the purpose of the inventor is not to limit the disclosure to the specific embodiments described herein. On the contrary, the purpose of the inventor is to protect all the improvements, equivalent conversions, and modifications based on spirit or scope defined by the claims in the disclosure. The same reference numbers may be used throughout the drawings to refer to the same or like parts.
In addition, some embodiments have been simplified in the present specification in order to provide a clearer picture of the technical solution of the present invention. It is to be understood that altering the structure, delay, clock cycle differences and internal connection of these embodiments within the framework of the technical solution of the present invention is intended to be within the scope of the appended claims.
The method and system in this disclosure uses a 2n address boundary-aligned L1 cache to store μOps, thereby avoiding the fragmentation and repetitive memory dilemmas inherent in μOp cache or other similar caches aligned with program entry points. Referring to
In this example, a block in L1 cache corresponds to a block in L2 cache. That is, a L1 cache block can accommodate all the Ops converted from all the instructions stored in a block in L2 cache. In variable-length instruction processor systems, an instruction often crosses the boundary of the instruction block, that is, the front and rear parts of an instruction are located in two instruction blocks. In this case, the latter part of the instruction that crosses the boundary of the instruction block is also classified as the instruction block belonging to the first half of the instruction block. Thus, all the μOps corresponding to the instructions that cross the boundary of the instruction block are stored in the L1 cache block corresponding to the instruction block in which the first half of the instruction is located, and the first μOp in each L1 cache block corresponds to the first instruction from the corresponding L2 cache block. Thus, the index on the instruction pointer 19 (IP) is used to select a set from the L1 cache 24, the tag of the instruction address 19 is used to match the corresponding ways in the set, and the address mapper 23 converts the offset 51 of the instruction pointer 19 to the μOp offset address BNY 57 to select the corresponding plurality of μOps starting from BNY in those matched ways. If the L1 cache match success signal 16 indicates “match success”, then the selector 26 selects the plural μOps output from the L1 cache 24. Else if the first-level cache matching success signal 16 indicates “match unsuccessful”, the L2 cache 21 is accessed according to the instruction pointer 19 in the usual way, that is, a set is selected according to the index of instruction pointer 19 and the tag in instruction address 19 are matched with the corresponding tags of the set, so that the desired instruction block is found in L2 cache 21. The instruction block output by the L2 cache 21 is converted to μOps by the instruction converter 12 and stored in L1 cache 24 while being sent to the processor core 28 via selector 26 simultaneously. In this process, once the instruction converter 12 determines that the last instruction in the sub-block crosses the block boundary, it calculates the address of the next instruction block by adding the current instruction block address to the byte length of the instruction block, and send the next block address to the level 2 tag unit 20 and the L2 cache 21 to acquire the corresponding L2 cache block and to convert the latter half of the instruction that crosses the block boundary. So that it can convert all the instructions in original L2 cache blocks to micro-operations and store them in L1 cache 24 and send them to the processor core 28 for execution. The L1 cache 24 supports reading consecutive μOps from any offset address in one block, which can be implemented by reading a whole
Op block from L1 cache 24 according to a block address and using a selector net or a shifter to select several consecutive
Ops which begin from the address of BNY 57 and have a length specified by reading width 65. Alternatively, a fixed number of consecutive μOps from 57 can be sent from 24 at each clock cycle, and the read width 65 can be sent to the processor 28, to determine the effective μOps therein.
The address mapper 23 includes a memory unit and a logical operation unit. The rows of the memory cells in 23 correspond to the μOp blocks in the L1 cache 24 and are addressed by the methods and tags of the same instruction address 19 as described above. Each row of the address mapper 23 stores the correspondence between the instructions in the instruction block in the L2 cache and the μOp in the μOp block in the L1 cache, for example: the fourth byte in the L2 cache sub-block is the start byte of an instruction and corresponds to the second μOp in the corresponding L1 cache block. In the embodiment of Ops of each instruction. This recorded information is sent to the address mapper 23 via bus 59 and stored in the memory cell row corresponding to the L1 cache block that stores those μOps.
In addition, since the number of μOps corresponding to each variable-length instruction sub-block may not be the same, the L1 cache memory space could be wasted if the L1 cache block size is determined according to the maximum number of possible μOps. In this case, it is possible to appropriately reduce the size of the μOp block and increase the number of μOp blocks, and add a corresponding entry 39 for each μOp block for recording the address information of other Op blocks which correspond to the same variable-length instruction. Please refer to the following examples for specific construction and operation.
Referring to Ops and the μOps corresponding to the branch instructions as ‘1’, and store them in the same order into the buffer 43 via bus 42. The counter 45 in the instruction converter 12 starts to count at the same time, the initial default value of which is the capacity of the L1 cache block, and each time a μOp is made and stored in the buffer, and the counter value is decremented by ‘1’. When all the instructions in the L2 instruction block (including instructions extending to the next instruction block but starting at the present L2 instruction block) are converted to μOps, the instruction converter 12 sends all μOps in the buffer 43 to the L1 cache 24 via the bus 48. The
Ops are stored most significant bit (right) aligned in a L1 cache block decided by cache replacement logic in L1 cache 24. The corresponding tag portion of the instruction address is also saved into the entry in L1 tag unit 22 which is corresponding to the way/set of this L1 cache block. At the same time, the record corresponding to the instruction start address in the buffer 43 in converter 12 is stored in the row of address mapper 23 corresponding to the L1 cache block, as shown in
Op start point record and the branch point record in the buffer are stored into entry 33 and 34 in address mapper 23 separately via bus 59, most significant bit (right) aligned; The value in counter 45 is also stored in the entry 37 of that row via bus 59, and the offset of the entry point is stored into entry 38 of that row via bus 59 as well.
Referring to
The output of the source array 54 is sent to the target array 55 for further processing. The target array 55 is also composed of selectors, each column of which is controlled directly by the bit of the target correspondence (in this case, entry 33). When a bit is ‘0’, each selector in the selector column controlled by this bit selects the input B so that it selects the input of the same row on its left; when a bit is ‘1’, each selector in the selector column controlled by this bit selects the input A so that it selects the input of the next row on its left; And for the B input of selectors in the leftmost column of the source array 55, all of the inputs are connected to the output of source array 54, except the bottom line taking ‘0’ as input; the bits of input B of the selectors in the bottom line and the input A of the top line are all ‘0’s. The outputs of the bottom line of the selectors are sent to encoder 56. Each time a bit ‘1’ from a row in source array 54 passes a column controlled by entry 33 which has a value ‘1’, that bit will shift down a row. When it is outputted from the bottom of target array 55, the position of that ‘1’ bit is the position in L1 instruction block of the μOp which corresponds to the entry point instruction. That location information is encoded by the encoder 56 into a binary valued μOp block offset BNY and sent out via bus 57.
The offset address translation module 50 is essentially detecting the corresponding relationship of the ‘1’ values in the two entries. Therefore, the result will be the same either by counting the number of before an address in the first entry in order from least significant bit to most significant bit, or by counting the number of before an address in the first entry in reverse order from most significant bit to least significant bit, to obtain the number to be mapped to an address in the second entry. In this case, mask 53 sets the bit which is corresponding to the address sent from bus 51 and the subsequent bits of it to ‘1’s. In the following examples, sequential conversion is illustrated as an example for ease of understanding.
The logical operation unit of the address mapper 23 is shown in Ops read at that time) and the instruction byte length 29 corresponding to these μOps. The μOp offset address 57 and the read width 65 control the L1 cache 24 to read a number of successive instructions starting from the BNY on the
Op offset address bus 57, and the number is determined by the read width 65. 29 provides the processor 28 with the corresponding instruction byte length of the μOp read at this time so that it can calculate the instruction address 18 for the next clock cycle.
Different architectures may have different read width requirements. Some architectures allow the same number of instructions to be provided to the processor core per clock cycle, with no other conditional restrictions. The reading width 65 at this time can be a fixed constant. However, some architectures require that the μOps corresponding to one same instruction must be sent to the processor core together in a single clock cycle (hereinafter referred to as the “first condition”). Some architectures require that all μOps corresponding to a branch instruction must be the last μOps that are sent to the processor core in a single cycle (hereinafter referred to as “the second condition”). There are also certain architectures that require both the first and second conditions. In
The leading 1 detector detects the shift result from the highest bit of the address (the address ‘4’) to the lowest bit of the address (the address ‘0’) (i.e. from the right to the left in this case) and outputs the address corresponding to the first ‘1’. Here, the bit corresponding to the address ‘4’ contains the first ‘1’, so the leading 1 detector outputs ‘4’, indicating that the maximum read width satisfying the first condition can reach ‘4’. The priority encoder 63 also includes a second leading 1 detector, which is used to output the address corresponding to the first ‘1’ by detecting from the lowest bit (which corresponds to address ‘0’) to the highest bit (which corresponds to address ‘3’) (i.e. from the left to the right in this case) of the 4 bits from the left of the shift result of entry 34 (i.e. ‘0010’). The output address is the first branch Op address after the entry point; After that is the second detection step, which detect the shift result of the entry 33 (‘10111’) from the first branch
Op address (‘2’) to the highest bit of the address (‘4’) (i.e. from left to right in this case) and output the corresponding address of the first ‘1’. The output address in this example is ‘3’, which indicates that the maximum reading width is ‘3’ when the second condition is satisfied. The second detection step of the second condition is set to exclude the situation that a branch instruction can correspond to a single μOp or a plurality of μOps. If the corresponding branch instruction in the architecture can only be one μOp, it can append a ‘0’ to the left of the shift result of entry 34 to become ‘00010’, detect the corresponding address to the first ‘1’ in that result from the lowest bit (‘0’) to the highest bit (‘4’) (i.e. from left to right in this case) and output the detected address (‘3’ in this example) directly without the need of the second detection step. Other cases are like this one, for example, if each branch instruction in the architecture is always translated to two μOps, then it only need to append two ‘0’ bits to the left of the shift result of the entry 34 and detect the first ‘1’ from left to right and output the corresponding address. The priority encoder 62 outputs the smaller read width of the outputs of leading 1 detector and second leading 1 detector as the actual read width. Therefore, the read width 65 in this example is ‘3’, which is used together with the BNY 57 value ‘2’ to control the L1 cache 24 to read the 3
Ops selected in one clock cycle (the corresponding BNY are ‘2’, ‘3’, and ‘4’) as is shown in
Ops are then output by selector 26 to processor core 28 for execution. Different architectures may have different requirements for read width, such as unrestricted, satisfying the first condition, satisfying the second condition, or satisfying both conditions. The above-mentioned read width generator can meet all four requirements, and other requirements can be met according to the basic principles. Depending on the conditions, the above read width generator can be trimmed until it is completely canceled and read at a fixed width. The embodiments disclosed in this specification are illustrated by the need to meet the first condition, and certain embodiments require meeting both the first condition and the second condition.
The adder 67, the down conversion module 50, and the subtractor 68 can convert the μOp read width in the form of BNY back to the number of bytes of the corresponding instruction. At this time, the adder 67 adds the value ‘2’ of the BNY 57 to the read width ‘3’, and the resulting result ‘5’ is sent to the decoder 52 in the down conversion module 50 (as shown in
The processor core 28 pre-decodes the received μOps to determine the μOp of the BNY of ‘4’ (the instruction corresponding to the instruction address offset of ‘9’) is a branch μOp, and the branch instruction address is sent via bus 47 branch target buffer 27 to find the match. If the value of the matching branch prediction signal 15 indicates that the branch transition has not occurred, then the signal control selector 25 selects the instruction address 18 output from the processor core 28 as the new instruction address 19. This instruction address is obtained by adding byte increment ‘7’ on the basis of the original instruction address ‘4’, so the tag part and the index part of the instruction address are the same as before, but the value of the offset 51 is hexadecimal ‘B’. The index value of the new instruction address still points to the row of the previous index in the tag unit 22. Based on the matching result of the tag and offset parts of the new instruction, the entries in the address mapper 23 (entry 31, 32, 33, 34, 37, 38 and 39) which are corresponding to the matched items in that row are found, and the contents of those entries are read out. The IP offset on bus 19 is processed according to the method described in Op block, use its IP tag and index part to read the corresponding row in the storage unit 30 in block address mapper 23. If the IP offset 51 value is less than the pointer in the entry 38, it indicates that the
Ops corresponding to that instruction is not stored in L1 cache yet. At this time, the system sends IP to the L2 tag 20 to be matched via bus 19, and reads the L2 instruction block from L2 cache 21 (the system can also do the L2 cache matching simultaneously with the L1 cache matching, rather than starting the L2 cache matching after the miss of L1 cache matching). The value of the above-mentioned entry 37 is sent to the counter 45 in the instruction converter 12, and the value of the entry 38 is sent to the instruction converter 12 for decrementing ‘1’ in the instruction translation module 41 and saved to the boundary register. The instruction translation module starts translating instructions to
Ops from the entry point until the IP offset in the instruction block is equals to the value in boundary register. The μOps of the conversion are performed by the processor core and stored in the buffer 43 in
If the instruction block is entered from the previous instruction block in the order of instruction execution, the entry point can be calculated from the information of the last instruction in the previous instruction block. The starting offset and the instruction length of the last instruction of the previous instruction block are known by the instruction translation module 41. From the instruction length (instruction block capacity—the starting address of the last instruction) to acknowledge the number of bytes that the last instruction occupies in the present instruction block, from which the starting address of the first instruction (sequential entry point) in this instruction block can be known. For example, if the instruction block has 8 bytes, the offset address of the last block of the last instruction block is ‘5’ and the instruction length is ‘4’, then (4−(8−5))=1. Then ‘1’ is the sequential entry point of this instruction block. The last instruction of the previous instruction block occupies the 4, 5, 6 bytes of the previous instruction block, and the ‘0’ byte of this instruction block. Therefore, the first instruction of this instruction block starts at ‘1’ bytes. If the instruction block does not have a corresponding L1 cache block, a L1 cache block is allocated by the L1 cache replacement logic. All the instructions starting from the sequential entry point in the present instruction block are converted into μOps and saved into the L1 cache block and the lines in the level 1 tag 22 and the address mapper 23 are created as above. If the instruction block has a corresponding level of cache block, like the example of the branch entry point above, it needs to compare the sequential entry point with the entry 38. If the sequential entry point address is less than the value of the entry 38, then translate the instruction from the sequential entry point to the address in the entry 38, and store the partial conversion result in that L1 cache block in the L1 cache 24 and the corresponding row entry in the address mapper 23 in the address mapper 30. Flag entry 32 can be added in the rows in 30. When the entry 32 is ‘1’, it indicates that the L1 cache block already contains all the μOps whose starting points are between the sequential entry point and the last byte of the corresponding instruction block are converted, and the entry 37 points to the first valid μOp corresponds to the sequential entry point in that L1 cache block. In this case, when entering a L1 cache block, it only needs to check whether the corresponding entry 32 is ‘1’. If the entry 32 is ‘1’, then: when a branch enters this L1 cache block, it does not need to compare the IP offset of the branch target with the entry 37, since the IP offset must be greater than or equal to the value in the entry 37; when entering this cache block sequentially, the value of the entry 37 can be directly used as the entry point, and it is not required to use the instruction translation module 41 to assist in calculating the entry point.
Depending on the needs of the processor core 28, the caching system may also provide instruction address offset or instruction address byte increment for branch instructions. In this case, the instruction address offset is the instruction address offset ‘9’ obtained by converting the sum ‘4’ of the μOp address ‘2’ and the number of μOps ‘2’; the instruction address byte increment is obtained by subtracting the current instruction address offset amount ‘4’ from the instruction address offset ‘9’ of the branch instruction (It can be reflected by the BNY of the branch μOp indicated by the entry 34 by the down conversion module 50 just like the above embodiment), and the result is ‘5’. Entries can also be set up for the branch instruction to record the IP offset address of the branch instructions, which the same as the entry 34. The caching system, particularly the address mapper 23, which contains all mapping relationships between instructions and μOps, can satisfy all the requirements of the processor core 28 for instruction or μOp access.
The buffer system (as indicated above in dashed line in
The embodiment of Ops, which are the outputs of the converter, are sent to processor core 28 for execution and are also stored into a L1 cache block selected by the replacement logic in L1 cache 24. The organization and addressing mode of the block address mapping module 81 is similar to that of the L2 cache 21. Each of the rows in the block address mapping module 81 corresponds to a L2 instruction block in the L2 cache 21 with four entries per row; each entry corresponds to a L2 cache sub-block. Each entry has a valid bit and stores the block number BN1X of the L1 cache block that contains the μOps converted from the instructions in the corresponding L2 cache sub-block. When the L2 tag 20 is accessed by the IP on the bus 19, it can use the set number (i.e. index) and the way number that is matched and the address of the sub-block to read out the entry in block address mapping module 81, put the valid signal of that entry on bus 16, and put its BN1X on bus 82. If the entry is valid, the storage unit 30 in the block offset mapping module 83 is read directly by the L1 cache block number BN1X on bus 82. The IP Offset on bus 51 is mapped to a L1 cache block offset BNY 57 in the manner shown in
Ops translated by converter 12 directly for the execution of the processor core 28. And store the block number of that instruction block 15 BN1X into the invalid entry in the block address mapping module 81 and set the entry to be valid.
In this way, the L1 tag 22 can be omitted by simply sending the IP on the bus 19 to the L2 tag 20 to be matched. If the μOp corresponding to IP already exists in the L1 cache 24 (the entry addressed by IP address in the block address mapping block 81, i.e., the output of the bus 16 is valid), the cache system provides the processor core 28 directly with the μOps in the cache 24; If the corresponding μOp is not in the L1 cache 24, the cache system will immediately output the corresponding instructions from the L2 cache, start conversion, therefore the cost of a cache miss is reduced effectively. This cache organization can also be used for deeper memory hierarchies. Take the three-tier cache as an example. The instructions can be stored in L3 cache. The instruction converter is located between the L2 cache and the L3 cache. The μOps are stored in L2 cache and L1 cache; The IP address is sent to the L3 block address mapper after the L3 tag matches. The L3 block address mapper contains entries corresponding to each L3 cache sub-block. The entry contains the block number of its corresponding L2 cache block. The L3 block address mapper also contains entries corresponding to each L2 cache sub-block which contains the block number of its corresponding L1 cache block. The offset mapping module corresponds to the L1 cache, in which stores the corresponding relationship between the μOps in the L1 cache block and the corresponding instruction sub-blocks and it also stores the mapping logic. In this way, even if the L1 cache is missing, there is no need for a long-delayed instruction conversion. This cache organization method is essentially that there is a correspondence between cache blocks (sub-blocks) between different levels of the cache hierarchy. In the lowest level of the hierarchy, IP is mapped into corresponding higher level block address BNX, and in higher level, the in-block offset of IP is mapped into Op block offset BNY to address in higher level cache. The embodiment of
In the embodiment of
The existing branch target buffer (BTB) is addressed by an IP address. The entry of BTB contains branch prediction, branch destination address, and/or branch target instruction, where the branch destination address is also recorded with an IP address. In the example of the entries in the branch target buffer 27 of the embodiments in Ops according to the read width 65 generated by the BNY and sends the
Ops to the processor core for execution. To fill in the entry in BTB 27, the branch target address on bus 19 is mapped into a BN format branch target by block address mapping module 81 and the block offset mapping module 83, and the BN format branch target is stored in the entry of BTB 27 pointed by the branch instruction address 47 generated by processor core. The branch destination address recorded in the branch target buffer entry can also be combined, in which the block address can be IP format, i.e. the higher bits of IP except offset (tag, index and L2 sub block index); or L2 block number (BN2X), including L2 way number, index and L2 sub-block index; or L1 block number BN1X format. These address formats are either mapped by the block address mapping module 81 or directly accessible to the L1 cache 24. The block offset address in it can either be IP offset which needs to be mapped to L1 cache block offset BNY by block index mapping module 83; or directly be BNY. The branch destination address in the branch target buffer 27 entry may be a combination of all the above block address formats and the block offset address formats. For more memory levels, the block address format can be obtained by analogy.
An entry that is recorded as an address in the branch target buffer 27 with BN1X or BN2X as an address may cause an error after the cache block replacement, that is, the L1 cache block pointed to by the branch destination address BN1X in the BTB record has been replaced and is no longer a branch target cache block. This problem can be solved with a Correlation Table (CT), and each row in the correlation table corresponds to a L1 cache block. There is a remapping entry in the row which stores the lower level cache block address (such as BN2X or IP block address), and the other entries store the BTB address of the BTB entry whose branch target is the cache block corresponding to that row (i.e. the address of the branch instruction). When a L1 cache block is created, its corresponding lower block address is recorded by the remapping entry of the corresponding row in the CT. When an entry whose branch target is that L1 cache block is recorded in the branch target buffer 27, the BTB address (branch instruction address) of that record is recorded in other entries in the CT corresponding to that L1 cache block. When an L1 cache block is replaced, checks the CT lines which corresponds to that block, and use the lower memory block address in the remapping entry to replace the L1 cache block address BN1X of the BTB entries recorded by the other entries.
Some small modification can be made to the processor core 28, the structure of the instruction converter 12, and the addressing mode for the branch target buffer 27 are so that the block offset mapping module 83 can be simplified to make the processor system more efficient. The correct IP maintenance of the processor core to has three meanings to the memory hierarchy: Firstly, it provides the next block offset address in the same memory (cache) block based on the exact block offset address; Secondly, it can provide the sequential next block address based on the exact block address; Thirdly, it can calculate the direct branch target address based on the exact block address and the exact block offset address. Here, the block address refers to the higher part of the IP address except the block offset address. As for the indirect branch instruction, it does not require accurate IP, because the calculation of the branch target address information (base address register number and branch offset) are already included in the instruction, without need of the command address information. The first meaning of the IP has been implemented by the block offset mapping module 83. If the requirements for the exact block offset address in the third meaning can be eliminated, then the system only needs to maintain accurate IP block address, and the exact L1 cache block offset BNY, to avoid the remapping from BNY to Offset.
The instruction converter 12 is slightly modified to achieve the above purpose. The instruction translation module 41 in the instruction converter 12 can add the block offset address of the instruction itself to the branch offset contained in the instruction when converting the direct branch instruction, and use the sum as the branch offset contained in the converted μOps. When the processor core executes the direct branch μOps that are modified by this method, it is possible to obtain an accurate branch target by adding the block address of the branch μOp to the modified branch offset in the μOp IP address. Thus, the need for an accurate instruction block offset IP Offset is eliminated. The processor core in this structure only needs to store the correct IP block address, so the down conversion module 50 and the subtractor 68 in the block offset mapping module 83 can be omitted. The processor core also maintains an adder that generates IP addresses for generating the indirect branch target address and the sequential next block address. When the processor core 28 executes indirect a branch μOp, the base address of the register heap is read according to the register heap address in the μOp, and added to the branch offset in the instruction to obtain the branch target address. The branch target address is sent via the bus 18. When the processor core 28 executes the direct branch μOp, the branch target address is obtained by adding the stored exact IP block address and the modified branch offset in the instruction, and is sent via bus 18. The controller 69 in the block offset mapping module 83 sends a change block signal to the processor core 28 when it is necessary to execute the next L1 cache block (when the output of the adder 67 exceeds the L1 cache block boundary). The processor core 28, under the control of that signal, causes its IP address adder to add ‘1’ at the lowest bit of the stored exact IP block address and set the block offset address IP offset to all ‘0’ and send it via bus 18. The controller 69 in the block offset mapping module 83, as described above, only causes the selector 63 to select the IP offset mapped by the up-conversion module 50, or select the value of entry 37 in
Since the processor core does not save the exact instruction block offset address, the addressing mode of the branch target buffer 27 should also be changed accordingly. The writing and reading of entries of the branch target buffer 27 can be addressed by using the IP block address and the μOp block offset address BNY. The exact BNY can be saved by the processor core, updated according to the read width 65 generated in the block offset mapping module 83, or updated at the entry point by the entry point BNY. When the processor checks the instruction and judges it to be a branch instruction, it will use the corresponding IP block address and the μOp block offset address BNY via the bus 47 to access the branch target buffer 27 to read the corresponding branch prediction value and the branch destination address or branch target instruction. It is also possible to read the branch μOp entry 34 in the storage unit 30 by the block offset mapping module 83 to determine the BNY address of the branch instruction, i.e. access the branch target buffer 27 via the bus 47 with the exact IP block address stored in the processor core. The IP block address can also be replaced with BN1X, BN2X address, etc., and be merged with BNY as the BTB address, if 15 guarantees to fill in and read the same format of BTB. The advantage of doing this is, for the BN1X block address is shorter than the IP block address, it will occupy less storage space. But the corresponding BN1X, BN2X block address of the corresponding IP address is not necessarily continuous, so every time after the IP block address updates, it needs to access the L2 tag 20 and the block address mapping module 81 via bus 19 to get the corresponding BN1X block address, etc. Only part of the IP addresses is saved in this architecture.
Further, two memory entries can be added for each L1 cache block to store the block address BN1X of the previous (P) and next (N) L1 cache blocks in the sequential order. The actual placement of the entry may be in a separate memory, either in the block offset mapping module 83, or in the CT, or even in the L1 cache 24. When the next instruction block is converted by the sequence enter point, the corresponding L1 cache block number BN1X is written to the N entry of the block and the BN1X of the block is written to the P table entry of the next L1 cache block. Thus, when the controller 69 in the block offset mapping module 83 in
A data structure named a track table can be used to replace BTB to further improve the processor system. The track table not only stores the branch instruction information, but also contains the information of the sequentially executed instructions. Op track table entry stores not only
Op type field 71, but also BNX field 72 and BNY field 73. Since it corresponds to the L1 cache 24, the track table 70 is filled from right to left beginning from the entry in the track table 70 where BNY is ‘3’. There are invalid entries in the low bits of BNY, which are expressed in shadows, such as K0 and M0.
Only the fields 72 and 73 are shown in the track table 70 of
The blank entries in the track table 70 shows the corresponding non-branch μOps and the remaining entries correspond to the branch μOps. These entries also show the L1 cache address (BN) of the branch target (micro operation) of the corresponding branch μOp. For the non-branch μOp entry on the track, the next μOp to be executed can only be the μOp represented by the entry on the right of the same track; For the last entry in the track, the next μOp to be executed is only possible to be the first valid μOp in the L1 cache block pointed to by the content of the end entry on the track; For the branch μOp table entry on the track, the next μOp to be executed may be the μOp represented by the entry on the right side of the entry, or it may be the μOp pointed by the BN in the entry, and the selection is depends on the branch judgment. Thus, the track table 70 contains all the program control flow information for all the μOps stored in the L1 cache 24.
Please refer to Op is read from L1 cache 24. When the pipeline is halted in the processor core 28, the update of the register 86 in the tracer is halted by the pipeline stop signal 92, causing the caching system to stop providing new μOps to the processor core 28.
Returning to
The controller 87 also compares the BNY on the read pointer 88 with the SBNY on the track table output bus 89 when the compressed track table of the format 74 is used as its track table 80 in the embodiment of
In the present embodiment, the L2 tag unit 20, the block address mapping module 81, and the L2 cache 21 correspond to each other. A same address can select the corresponding rows of the three, where the L2 cache 21 stores the instruction; The track table 80, the memory unit 30 in the block offset address mapper 93, the correlation table 104, and the L1 cache 24 correspond to each other and a same address can select the corresponding rows of the four. The address format in this example is shown in Op block address BN1X, and the field 73 is the
Op block offset address BNY, which is the same as the embodiments of
Back to Ops and saved in this cache block; and it uses the BN2X to address in the L2 cache 21, reads the corresponding L2 sub-block and send it to the address conversion scanner 102 via bus 40; and the memory address IP on bus 19 is also sent to scanner 102 via bus 101. The scanner 102 starts from the byte pointed by the offset field 108 of the IP address and translate the L2 instruction sub-block into
Ops and sent the result out via bus 46. At this time, the controller 87 controls the selector 26 to choose the
Op on bus 46 for the execution of the processor core 98. And the scanner decodes the operation code of the converted instruction. If the instruction is a branch instruction, the
Op type 71 is generated by the type of the branch instruction and a track entry is allocated for it and saved in the temporary track of buffer 43 from left to right according to the order of the instructions in the instruction block. The scanner 102 does not allocate an entry for the non-branch instruction, thereby achieving the compression of the track.
When the instruction type is a direct branch, the scanner 102 also adds the field 105, 106, 107 in the IP address sent via the bus 101 and the block IP offset of the branch instruction itself (i.e. the address of the branch instruction itself) to the branch offset described in the instruction, to calculate the branch target instruction address of that direct branch instruction. The branch target address is sent via the bus 103, the selector 95, and the bus 19 to the L2 tag unit 20 to match. If there is no match, it reads the instruction block which contains the branch target from the lower level memory and store it to L2 cache 21, and store the tag field 105 of the branch target address on bus 19 into L2 tag unit 20. If the tag is matched, the matched way number 109 and the fields 106, 107, 108 of the bus 19 form a L2 cache address BN2 and the BN2 is stored into buffer 43 of the scanner 102. The L2 cache block address BN2X formed by field 109,106,107 is stored in field 72, and the instruction block offset field 108 is stored into field 73. And the block offset address BNY corresponding to the μOp of the branch instruction is stored in the SBNY field 75. In this way, all the fields except the branch prediction field 76 in an entry of the track table, are generated with L2 tag 20 as the same time as the scanner 102 converts the instruction.
If the instruction type is indirect branch, the scanner 102 generates the μOp type field 71 and the SBNY field 75 for its corresponding track table entry, but does not calculate its branch target, and does not fill its fields 72 and 73. So that it converts and extracts to the last instruction of the instruction block. The scanner 102 calculates the L2 cache sub-block address BN2X of the next sequential sub-block by adding ‘1’ to the BN2X address of the sub-block. However, if this calculation results in a carry on the boundaries of fields 107 and 106 (or crossing the boundary of the L2 instruction block), it needs to add ‘1’ to the IP sub-block address (fields 105, 106, 107) to calculate the IP address of the next sequential sub-block, and sends it to the L2 tag unit 20 via the bus 103 to be matched into the BN2X address. If the last instruction extends to the next instruction sub-block, then the scanner 102 uses the BN2X address of the next sub-block described above to read the next sub-block from L2 cache 21 to convert the last instruction of this block completely, and extract the information to save in buffer 43. After that, it creates an end tracing point entry to the last (rightmost) existing entry of the temperate tracks in buffer 43, saves ‘4’ into the SBNY field 75, saves ‘non-conditional branch’ in the type field 71, saves the next block address BN2X described above into the block address field 72, and saves the starting byte address of the first instruction of the next instruction block into the block offset address field 73.
At the same time of the instruction conversion operation described above, the system addresses one row in the correlation table (CT) 104 using the block address BN1X of the replaceable L1 cache block described above, and uses the L2 cache block address BN2X stored in the remapping entry of it to replace the BN1X in the track of the track table 80 which is identified by the address stored in other entries in the row of CT 104. That is, replacing the branch path which points to the L1 cache block being replaced to pointing to its corresponding L2 branch sub-block. The system also invalidates the entry addressed by BN2X in the above-mentioned remapping entry in the block address mapping module 81 so that the replaced L1 cache block is disconnected from its original corresponding L2 branch sub-block; That is, all the mapping relationship with the replaced L1 cache block is removed, so that the replacement of the L1 cache block will not lead to tracking errors. And the system stores the L2 cache block address of the converted instruction sub-block in the remapping entry of that row in the CT 104 and invalidates the other entries of the row. After that, the Op 35 temporarily stored in the buffer 43 in the instruction conversion scanner 102 is stored in L1 cache block pointed by the above-mentioned BN1X aligned by the highest bit; The track temporarily stored in the buffer 43 is also stored into the track in the track table 80 pointed by the BN1X, aligned by the highest bit; The table entries 31, 33 and so on stored in the buffer 43 are also stored into the row of the block offset address mapper 93 designated by the BN1X in the storage unit 30 as described in the embodiment of
The read pointer 88 of the output of the tracer addresses the L1 cache 24 to read the μOps for execution by the processor core 98 and also addresses the track table 80 via the bus 89 and reads out the entry (which corresponds to the instruction itself read from the L1 cache 24 or the first branch instruction after it). The controller 87 decodes the type field 71 of the bus 89. If the address type is L2 cache address BN2, the controller 87 then controls the selector 95 to select the address on bus 89, and directly addresses the block address mapping module 81 by the L2 cache block address in the BN2X of the BN2 via bus 19, and reads the entry via 82 without needing to match in the L2 tag unit 20. If the entry read from the bus 82 is ‘invalid’, it means that the L2 cache instruction sub-block addressed by the block number BN2X of that BN2 has not been converted to μOps and not stored into the L1 cache 24 At this time, the system uses the BN2X on bus 19 to address L2 tag unit 20, and reads out the corresponding tag 107, which, along with the index 106 on bus 19, the L2 sub-block number 107, and the block offset 108, is composed into a complete IP address. The IP address is sent to the instruction conversion scanner 102 via bus 101; and the system also uses that BN2X to address the L2 cache 21 to read out the corresponding L2 cache instruction sub-block and sends the result to scanner 102 via bus 40. The scanner 102 then converts (as described above) the instructions in the instruction block into μOps and sends them via the bus 46 and the selector 26 to the processor core 98 for execution; The scanner 102 also stores the μOps and the information obtained by the exaction, the calculation and the matching in the conversion process into the buffer 43. The L1 cache replacement logic provides a replaceable L1 cache block number BN1X After the instruction block conversion is complete, the scanner 102 stores (as described above) the μOps in the buffer 43 into the L1 cache block addressed by that BN1X in the L1 cache 24. The scanner 102 also stores the other information in the buffer 43 into the row in the storage unit 30 pointed to by the BN1X in the address offset mapper 93, and updates the row pointed to by the BN1X in the correlation table 104. The scanner 102 also stores the BN1X value in the block address mapping module 81 as described above and validates that entry value. Thereafter, when the entry in the block address mapping module 81, which is addressed by the BN2X output by the trace table 80 on bus 19, is ‘valid’, then the entry of the bus 82 is ‘valid’. The system then addresses the storage unit 30 in the block offset address mapper 93 by the BN1X on bus 82, and reads out the entries 31 and 33 in the row selected by that BN1X. According to the mapping relationship of entries 31 and 33, the offset address conversion module in the block offset address mapper 93 maps the block offset 108 on bus 19 into corresponding Op offset BNY 73 and outputs it via bus 57. BN1X on bus 82 is merged with BNY on bus 57 to become a L1 cache address BN1. The system control replaces BN2 in the track table 80 entry with the BN1 and set the type field 71 format as BN1. The system may also bypass the BN1X directly to the bus 89 for the use by the controller 87 and the tracer.
The controller 87 controls the operation of the tracer according to the branch prediction 76 on the bus 89. There are two registers in the tracer to keep the branches of the branch μOp at the same time so that the branch can be returned when the prediction is wrong. The register 96 stores the address of the fall-through μOp of the branch μOp; the register 86 stores the address of the target μOp. The storage 30 in the block offset address mapper 93 will read the entries 31 and 33 via bus 82 when it needs to map the L2 cache address BN2 to L1 cache address BN1 as described above. At the other times, it reads the entry 33 addressed by the BN1X in read pointer 88 to provide the first condition (or the entry 33 can be designed to have duel read ports to avoid interfering each other). The number of μOps to be read can be controlled by reading the width of the second condition as described above with the contents of the entry 34; This number can also be obtained by subtracting the value of the read pointer 88 from the branch μOp address SBNY in the field 75 in the track table entry and adding ‘1’ to the result. If the result is less than or equal to the maximum read width, the result is the read width; if the result is greater than the maximum read width, the maximum read width is the read width. The present embodiment assumes that the read width is controlled by the second condition, i.e. the Ops of and after the branch point read the block offset address BNY in read pointer 88 in different cycles and control the shifter 61 to shift the entry 33 as is shown in
Ops correspond to complete instruction) by priority encoder 63. If there is no requirement for the first condition, the read width 65 can be a fixed number of which amount the instructions can be read at a time. The read pointer 88 provides the L1 cache 24 with starting address, and the read width 65 provides the L1 cache 24 with the number of
Ops read in one cycle. The adder 94 adds the BNY value on read pointer 88 with the value of read width 65. The output of the adder 94 is used to be the new BNY and is combined with the BN1X value on read pointer 88 to be BN1, which is output by bus 99.
The controller 87 compares the BNY value on bus 99 with the SBNY value on the bus 89, and if the BNY is less than SBNY, the controller 87 controls the selector 90 to select the value on the bus 99 and saves it into the register 96; The controller 87 also controls the selector 85 to select the BN1 address (fields 72 and 73) on the bus 89 to be stored in the register 86 (or stores only if the value on the bus 89 changes). The controller 87 then controls the selector 97 to select the output of register 96 to be the next read pointer. If the BNY on the bus 99 is equal to SBNY on the bus 89, which means the branch μOp corresponding to the entry of the track table output via the bus 89 is read in this cycle then the controller 87 controls the system by the prediction value 76 on the bus 89. If the branch prediction value 76 is unbranched, the controller 87 controls the L1 cache 24 to transmit the μOps to the processor core 98 according to the read width 65. But according to the SBNY field 75 on bus 89, it sets the flag bits of the Ops whose BNY addresses are greater than the branch point corresponding to that SBNY. Each μOp sent from the L1 cache 24 to the processor core 98 in the present embodiment has a flag bit. Refer to
Op 111 in it is a branch μOp, and the μOp segment 112 is the fall-through
Ops of the branch operation; the
Op 113 is the branch target
Op, and the
Op segment 114 is the fall-through
Ops of the branch target operation. Return to
The adder 94 continues to add the BNY of the read pointer 88 and the read width 65. The sum together with the BN1X on read pointer 88 are sent via bus 99 and stored into register 96 as read pointer 88 for the next cycle, which controls 24 to send the corresponding μOps for the execution of processor core 98. The above process is repeated till a branch decision 91 is made and is sent to controller 87.
If the judgment is ‘do not execute branching’, the controller 87 controls the processor core 98 to retire the μOps that are flagged to be executed by prediction. The controller 87, also as the method described above, saves the output 99 of the adder 84 into register 96, controls the selector 97 to select the output of the register 96 to be the next read pointer. In this way, the loop between the adder 94 and the register 96 is proceeded. If the judgment is ‘execute branching’, then the controller 87 controls the processor core to abort the Ops that are flagged to be executed by prediction. The controller 87 also controls the selector 97 to select the register 86 (at this time the content of it is the branch target from bus 89, that is, the address of the
Op 113 of the
Ops (the number of which is determined by the read width 65 as described above). After that, the controller 87 combines the sum of the read pointer 88 and the read width 65, and the BN1X on the read pointer 99 into 99 and saves it into register 96. And it also controls the selector 97 to select the output of the register 96 to be the next read pointer and loop ahead like this.
If the branch prediction value 76 is ‘execute branching’, the controller 87 saves the BN1 address (i.e. the address of the first Op of the
Op 111 of the
Op 111 and the
Ops before it in
Ops (the
Op 113 and the
Op segment 114 in
Ops as ‘speculate execution’. At the same time the controller 87 controls the selector 85 to select the output 99 of the adder 94 and saves it to register 86. At the next cycle, the controller 87 controls the selector 97 to select the output of the register 86 as the read pointer 88 to access the track table 80 and the L1 cache 24. The loop between the adder 94 and the register 86 is kept until the processor 98 executes the said sent
Ops and generates the branch judgment 91 to send to the controller 87.
In this embodiment, the end track point in the track is recorded as non-conditional branch type. When the BNY on the output 99 of the adder 94 is equal to or greater than SBNY in the field 75 on the bus 89, the controller 87 controls the L1 cache 24 to send the μOps, which begin at the address of the read pointer 88 and end at the last Op of this L1 cache block, to the processor core 98 for execution. In the next cycle, the controller 87 controls the selector 97 to select the output of the register 86 to be the read pointer 88, and does not set the flag bits of the μOps sent in this cycle; stores the output 99 of the adder 94 into register 96; saves the BN1 address on bus 89 into register 86. In the cycle after next, the controller 87 controls the selector 97 to select the output of the register 96 to be the read pointer 88. In this way, the loop between the adder 94 and the register 96 continues to proceed.
When the controller 87 decodes the type field 71 on the bus 89 and judges the entry to be the indirect branch type, it controls the cache system to provide the processor 98 with μOps as described above until the Op corresponding to the said indirect branch entry comes. The controller 87 then controls the cache system to suspend the μOps to the processor core 98. The processor core executes the indirect branch
Op and uses the register number of the
Op to read the base address in the register heap, and adds the base address and the branch offset in the
Op to get the branch target address. The IP of that branch target is sent to L2 tag 20 to be matched via bus 18, selector 95, and bus 19. The matching procedure and the operations are described above. The BN1 address obtained by the matching is bypassed to the bus 89. The controller 87 controls to save that BN1 into register 86. In the next cycle, it is executed according to the branch judgment 91 sent by the processor core 98, or according to the processor architecture (The indirect branch of some architectures is fixed to unconditional). The execution is the same as when the branch prediction is ‘speculate execution’ described above, but it does not need to set the flag bit of the
Ops, and does not need to wait for the branch judgment 91 generated by the processor core 98 to confirm the accuracy of the prediction.
The BN obtained by the IP address mapping of the said indirect branch target can be stored into the said indirect branch entry of the track table, and promote the instruction type of it to be the indirect-direct type. When the next time the controller 87 reads that entry, it treats it as direct branch type to be executed by the branch prediction method, i.e. set the flag bits of the Ops as ‘speculate execution’. When the processor core executes that indirect branch
Op, it sends out the branch target IP address via bus 18. The address is mapped into BN1 address by L2 tag and so on as described above and the BN1 is compared with the BN1 output by the track table. If they are identical, then the controller retires all
Ops that are ‘speculate execution’ and continues to execute forward; if they are different, then all
Ops that are ‘speculate execution’ are aborted, and save the BN1 obtained by the IP address mapping into that indirect-direct entry in the track table and bypass it to bus 89. The controller 87 saves the BN1 into register 86, and controls the selector 97 to select the output of the register 86 to be the read pointer 88 to access L1 cache 24, and provides processor core 98 with the
Ops starting from the correct indirect branch target. It can also remap the BN1 in the indirect-direct entry into the corresponding IP address, and compare the IP address calculated by the processor core 98 and the remapped IP address while the processor core 98 is executing the indirect branch
Op. The remapping process is, read the entries 31 and 33 in storage unit 30 by the BN1X address in BN1, use the method of the down conversion module 50 like the embodiment of
When the BNY on bus 99 is equal to the SBNY on bus 89, if the branch prediction value 76 on the bus 89 is ‘predict to branch’, then the selector 85 selects the branch target address BN1 on the bus 89 to store into register 86 to update the read pointer 88 to control the L1 cache 24 to send the branch target Ops (113 of
Ops (the
Ops on segment 114 of
Ops are flagged by one same new allocated flag value ‘1’; At the same time the address on the bus 99 (which is the
Ops that fall-through the branch
Op), the branch prediction value 76 on bus 89 and the new flag ‘1’ are saved into the entry pointed to by the write pointer in the FIFO 136. When the BNY on the bus 99 is equal to the SBNY on the bus 89, if the branch prediction value 76 on the bus 89 is ‘predict not branch’, then the selector 85 selects the fall-through
Op address on bus 99 to save into register 86 to update the read pointer 88 to control the L1 cache 24 to send the fall-through operations of the branch
Op to the processor 98 for execution. Those
Ops are also flagged by a same new allocated flag value; At the same time, the branch target
Op address on bus 89, the branch prediction value 76 on bus 89 and the new flag value are saved into the entry pointed by the write pointer in the FIFO 136. In short, the μOp addresses that are not selected by the branch prediction are stored in the FIFO 136 together with the corresponding branch prediction value and flag value. At the other time that the BNY on the bus 99 is not equal to the SBNY on the bus 89, the selector 85 selects the output 99 of the adder 94 to update the read pointer 88 to control the L1 cache 24 to send the fall-through
Ops to the processor core 98 for execution. These
Ops use the flag value that is allocated when the last time the BNY on the bus 99 equals to the SBNY on the bus 89.
When the processor core 98 generates the branch judgment, it reads out the entry in the FIFO 136 which is pointed to by the inside read pointer. The branch prediction 76 inside the entry is then compared with the branch judgment 91. If they are identical, that is, the branch prediction is right, then execute, write back and commit all of the μOps flagged by the flag value in the said entry read from FIFO 136 by processor core 98; and the comparison result controls the selector 137 to select the output of the selector 85, and makes the tracer continue updating the read pointer 88 according to its present status and send Ops to processor core 98 for execution. Also, the inner read pointer of the FIFO 136 points to the fall-through entry.
If the comparison result is different, then the branch prediction is wrong, so the result controls the selector 137 to select the FIFO 136 to output the L1 cache address BN1 of the entry to save in the register 86, and uses the address of the path that is not selected by the branch prediction to update the pointer 88, and sends the Ops to the processor core 98 for execution. All of the
Ops, which are flagged by the flag in the entry output by the FIFO 136 of the processor core and the following entries, are aborted. The method can be reading all the entries (from the read pointer to the write pointer) in FIFO 136 and aborting all the
Ops in the processor core that are flagged by the flags of the entries. After that, at the next branch point, according to the value on bus 89, which is selected by the selector 85 according to the branch prediction 76, to update the read pointer 88; and the flag value allocated to it, the path address that is not selected by the branch prediction 76, and the value of the branch prediction 76 are stored into FIFO 136. The above loop makes the processor core 98 execute
Ops according to the branch prediction value of the branch prediction 76. And when the processor core 98 generates the branch judgment 91, the branch judgment 91 is compared with the corresponding branch prediction 76 stored in the FIFO 136. If they are not identical, then abort the execution of the operations that are predict to be executed, and return to the path that is not selected by the branch prediction. The other operations in the embodiment of
A L1 cache with dual port, which can be addressed by the fall-through (FT) address of the branch Op and the branch target (TG) provided by the tracer and the track table simultaneously, can provide the processor core with the fall-through
Ops tagged by FT and the branch target
Ops tagged by TG simultaneously for the execution. After the processor gives the judgment of the branch
Op, it can selectively give up one set of
Ops in the FT and TG according to the judgment, and select the address of the other set to continue execution by the tracer addressing the track table and the L1 cache. Since the sequential μOp is mostly in the same L1 cache block, the same function of the duel port L1 cache can be implemented by an instruction read buffer (IRB) that can store at least one L1 cache block to replace one of the read ports of L1 cache to provide the FT μOps, and a single port L1 cache to provide the TG
Ops.
The instruction read buffer 120 in Ops of both branches of a branch at the same time. In this embodiment, the L2 tag unit 20, the block address mapping module 81, the L2 cache 21, the instruction conversion scanner 102, the block offset address mapper 93, the correlation table 104, the track table 80, the L1 cache 24 and the processor core 98 are identical to the embodiment of
Op type of the output 89 of the track table 80 to control the operation of the cache system, and compares the SBNY on the bus 89 and the BNY on bus 99 to obtain the branch operation time point. The selector 121, under the control of the controller 87, selects the read pointer 88 or the pointer 127 to be the address 133 to address the track table 80, with the default selection of pointer 88. The processing of the indirect branch μOp is the same as that of the embodiment of
The instructions are stored into L2 cache 21, and the address tags are stored into L2 tag unit 20. The instructions are translated into Ops and stored into L1 cache 24. The control flow information in the instruction is extracted and stored into track table 80. The block address mapping module 81, the block offset address mapper 93 and the CT 104 have the same operations and procedures with the embodiment of
Op in the processor core 98, is stored into IRB 120, and is addressed by the BNY in read pointer 88 each cycle. The obtained plural
Ops allowed by the maximum read width is sent to the processor core 98 via bus 118; And the read width generator in the block offset row 122, according to the information in its entry 33 and the BNY on read pointer 88, generates the read width 139 to mark the valid
Ops. The processor core 98 omits the invalid
Ops. The read pointer 88 also addresses the track table 80 via the selector 121 and read the entry via the bus 89. At each cycle, the controller 87 can compare the SBNY on bus 89 with the SBNY stored last cycle in the controller 87. If they are not identical, it indicates that the value on bus 89 changed, so it stores the SBNY on bus 89 into controller 87 each cycle for the comparison next cycle. When the controller 87 finds out a change on the bus 89, it controls the selector 125 in the target tracer to select the branch target BN1 on bus 89 to store in the register 126 to update the read pointer 127. The BN1X of the read pointer addresses the L1 cache 24 to provide branch
Ops to the processor core 98 via bus 48. The BN1X of the read pointer also addresses and reads the entry 33 of the corresponding row in the storage unit 30 of the block offset address mapper 93. The read width generator in the block offset address mapper 93, according to the information in the entry 33 and the BNY on the read pointer 127 to generate read width 65 to mark the valid
Op. These valid
Ops are marked as branch target ‘TG’. On the other hand, the controller 87 also compares the SBNY on the bus 89 and the BNY on the bus 99. If the BNY is greater than SBNY, the controller 87 marks all the
Ops that the IRB 120 sent to the processor core 98 whose block offset addresses are greater than SBNY to be ‘FT’, which means to be executed at the ‘fall-through’ circumstance.
If the controller 87 decodes the field 71 of the bus 89 to be a conditional branch, then it waits for the processor 98 to generate branch judgment 91 to control the program flow. Before the branch judgment is made, the selector 85 in the present tracer 131 selects the output 99 of the adder 94 to store in the register 86 to update the read pointer 88, and controls the IRB 120 continue providing the processor core 98 with ‘FT’ instructions until the next branch point; The selector 125 in the target tracer 132 selects the output 129 of the adder 124 to store into the register 126 to update the read pointer 127, and continues providing the processor 98 with ‘TG’ instructions until the next branch point. The processor 98 executes the branch Ops to obtain the branch judgment 91. If the branch judgment 91 is ‘not branch’, the processor 98 aborts all the
Ops that are marked as ‘TG’. The branch judgment 91 also controls the selector 85 to select the output 99 of the adder 94 to store into the register 86, and makes BNY in the read pointer 88 continue to point to the fall-through
Op of the said ‘FT’
Ops in the IRB 120. The block offset row 122 calculates the corresponding read width according to this BNY to set the valid
Ops to send to the processor core 98 for execution. The read pointer 88 addresses the track table 80 via the selector 121, and reads the entry through the bus 89. When the controller 87 finds out a change on the bus 89, it makes the selector 125 to select the BN1 on bus 89 to store into register 126, and makes the read pointer to address L1 cache 24, and sets the valid instructions by the read width 65, and marks the new branch target
Ops as ‘TG’ and sent them to the processor 98 for execution as described above.
When the branch judgment 91 is ‘branch’, the processor 98 abort the execution of all the Ops with ‘FT’ flag. The branch judgment 91 also controls the selector 85 in the present tracer 131 to select output 129 of the adder 124 in the target tracer 132 to store into the register 86 to update the read pointer 88, and saves the L1 cache block addressed by the read pointer 127 at this time in the L1 cache to store into IRB 120; and stores the entry 33, which is addressed by the pointer 127 in the storage unit 30 of the block offset mapper 93, into the block offset row 122. The BNY of the read pointer 88 is pointing to the
Ops that follows the said ‘TG’
Ops that are just stored into the IRB 120. The block offset row 122, according to that BNY, calculates the corresponding read width to set the valid
Ops to send to the processor core 98 for execution. The read pointer 88 also addresses the tracer 80 via the selector 121, and read the first branch target from the original branch target track corresponding to the L1 cache block that is just stored into IRB 120. The first branch target is stored by the controller 87 into the register 126 of the target tracer and is used to update the read pointer 127. The read pointer 127 addresses the L1 cache 24, marks the
Ops that are corresponding to the original branch target as ‘TG’, and sends them to the processor core 98 for execution. If the controller 87 decodes the type on bus 89 and judges is to be non-conditional branch, then controller 87 detects the BNY value on bus 99. If it is equal to the SBNY on bus 89, set the branch judgment 91 to be ‘branch’ directly. Then the processor 98 and the cache system execute as the circumstance that the branch judgment 91 is ‘branch’ as described above and the procedure is the same. There is an optimization that it can directly set the fall-through
Ops of the branch
Ops to be invalid rather than ‘FT’, so that the processor core 98 can utilize the resources more efficiently.
When all of the branch μOps in IRB 120 have been sent to processor core 98 for execution, the end track entry entries for the corresponding tracks are output by the track 89 via bus 89. The controller 87 detects the change on the bus 89, and controls the selector 125 to select the bus 89 and stores the next L1 cache block address BN1 in the end track point on bus 89 into the register 126 to update the read pointer 127. The subsequent operations are similar to those described above for the unconditional branch. i.e., the read pointer 88 addresses the IRB 120 to send out the Ops, and the IRB 120 automatically marks the output word lines that exceed the L1 cache capability to be invalid. The read pointer 127 addresses the L1 cache 24 to send out the
Ops marked by ‘TG’ to the processor core 98 for execution. Therefore, the
Ops before the end track point on IRB 120 and the
Ops in the next fall-through L1 cache block are sent to the processor 98 for execution. The controller 87 detects the BNY value on bus 99. If it is equal to the SBNY on the bus 89, it indicates that the last
Op in IRB 120 in this clock cycle has already been sent to the processor core 98 for execution. If the controller 87 decodes the type on bus 89 and detects it to be unconditional branch, it set the branch judgment 91 to be ‘branch’ directly. At this time, the controller 87 controls the selector 85 in the present tracer 131 to select the output 129 of the adder 124 of the target tracer 132 to store into the register 86 to update the read pointer 88, and controls to store the L1 cache block addressed by the read pointer 127 in the L1 cache 24 into the IRB 120; and stores the entry 33 addressed by the read pointer 127 in the storage unit 30 of block offset address mapper 93 into the block offset row 122. The BNY of the read pointer 88 points to the
Ops after the said ‘TG’
Ops in the IRB 120. The block offset row 122 also calculates the corresponding read width according to the BNY to set the valid
Ops to be sent to processor core 98 for execution.
When the BNY value on the bus 129 output from the adder 124 in the target tracer 132 exceeds the capacity of the L1 cache block (hereinafter referred to as overflow), it indicates that in the next clock cycle it should send the μOps in the fall-through cache block of the present branch target L1 cache block pointed by the current read pointer 127 to the processor core 98 for execution. When the controller 87 judges that this BNY overflows, the control selector 121 selects the read pointer 127 (at this time it points to the end track point) as the address 133 to address the track table 80, and sends the next block address BN1 of the end track point via the bus 89. The controller 87 further controls the selector 125 of 132 to select bus 89, and then stores this BN1 into the register 126 to update the read pointer 127. The cache system also provides the processor core 98 with μOp of the next sequential cache block that obtained by the updated read pointer 127 addressing the L1 cache 24. The block offset address mapper 93 also reads the corresponding entry 33 in the storage unit 30 in the BNX of the updated read pointer 127, and generates a read width 65 according to the BNY of the read pointer 127 to set valid μOps. The read width 65 and the BNY of read pointer 127 are added by the adder 124 to generate the BNY on bus 129 to for further use.
The track table can provide a branch μOp (or instruction) address (as shown in
Although the present invention using processor system that executes variable-length instructions as examples a, the cache system and processor system of this discloser can be applied to the processor system that executes fixed-length instructions. In this case, the lower portion of the memory address (IP Offset) of the fixed-length instruction can be directly used as the block offset address BNY of the cache, and the block offset address mapping is not required. In this case, the (IP Offset of the address of the processor system that executes the fixed-length instruction is named BNY to distinguish it from the variable-length instruction address. The address format of the processor system that executes the fixed-length instruction is shown in
The method described in Op before the
Op segment), so that branch decision can abandon executing the unselected μOp segments based on branch Hierarchy. This flag system dispatches a flag for each μOp segment, which flag represents the branch hierarchy of the segment and the branch attribute of the segment (this segment is the branch target μOp segment of previous instruction segment, or the fall-through μOp segment of non-branching); in the flag system, the branch decision produced by the processor core after executing the branch instruction is also expressed according to the branch Hierarchy and branch attribute of the flag system; therefore, this results in the speculate executed
Op segments not selected by branch decision are abandoned as early as possible, and ensures the speculated executed μOp segments selected by branch decision be normally executed and committed. This flag system can ensure the correct commitment order of disorderly dispatched μOp segment based on the hierarchical information in the flag, while the μOp order within the μOp segment is guaranteed by the order of μOps in the μOp segment. A kind of Hierarchical Branch Label System is shown in
In this flag system, the write pointer 138 attached to each μOp segment means the branch hierarchy of this μOp segment, the flag 140 attached in the μOp segment stores the branch attribute of this μOp segment in the pointed location of flag 138. The processor core produces branch decision (i.e., branch attributes) and a flag reading pointer to indicate the branch hierarchy of the branch to which 91 belongs, to make flag comparison with each μOp segment. Further, the flag system also expresses the branch history of the corresponding μOp segment (the position in the branch tree, which is expressed by the bit of the flag 140 between flag write pointer produced by the flag 138 and the flag read pointer produced by the processor core in this μOp segment), so when the execution of one fork of the branch is aborted, the execution of the children and grandchildren instruction segments of the fork are also aborted, which can release the ROB entries, reservation station, scheduler or execution units or other resources occupied by these μOps as soon as possible. The flag system has a history window (i.e. the bit number of the flag 140), which window length is greater than all outstanding segments in the processor, so as not to produce flag aliasing.
Wherein the flag 140 is the flag, whose format contains 3 binary bits. Among them, the entry (bit) in the left represents a level of branch, the middle digit means the daughter branch in next level, and the digit in the right represents the granddaughter branch of the next-next level. The value of each bit is the branch attribute of this μOp segment, where “0” means that the μOp segment is the fall-through μOp segment of its previous branch μOps, “1” means that the μOp segment is the branch target μOp segment of its previous branch μOps. The flag read pointer 138 represents the branch level of its μOp segment, and the bit pointed by the 138 stores the branch attribute of its μOp segment. The value which represents the branch attribute of the μOp segment is written to the bit pointed to by the flag read pointer 138, without affecting other bits.
For example, μOp segment 142 is the fall-through segment of μOp segment 141, the value of whose attached flag 140 is ‘0xx’, wherein ‘x’ means the original value, and its flag write pointer 138 points to the left bit. Correspondingly, the μOp segment 146 is the branch target segment of branch μOp 141, the value of its attached flag is ‘1xx’, and the flag write pointer also points to the left bit. When all operations (including branch μOp 143) in μOp segment 142 are sent out by cache system using ‘0xx’ flag, the fall-through segment 144 of μOp segment 143 and branch target segment 145 are also sent out. The way of flag system to generate new flag for the μOp segment is to inherit the μOp segment flag of the last level (namely the parent branch before a branch) to move the flag write pointer right by a bit (the branch hierarchy reduces one level), and write in the branch attribute in the bit the hierarchical pointer points to. Therefore, the flag inherited from the μOp segment 142 is ‘0xx’, and now the flag write pointer points to the middle bit; and the flag of the fall-through segment 144 of branch μOp 143 should be ‘00x’ according to the rules, and the flag of branch target segment 145 should be ‘01x’ according to rules. In the same manner, the flag of fall-trough 148 of branch μOp 147 shall be ‘10x’ based on rules, and the flag of branch target segment 149 is ‘11x’. Each operation segment sent by cache system all comes with the flag to the μOp segment which it belongs to. There is a flag read pointer in the processor core, and each time the processor core produces a branch decision, it will compare that branch decision with the bit pointed by the read pointer in flag 140 in the Ops being executed in the processor core to abort the execution of part of the μOps, then, the read pointer of this flag moves to the right by one bit.
Assume that the processor core executes branch μOp 141, and gets branch decision ‘1’, which means the branch is taken at this point, according to the execution order, the flag read pointer generated by the processor points to the left bit of the flags in
The processor core continues to execute those μOp segments 146, 148, and 149 which are retained by branch decision of μOp 141. At this point, the flag read pointer moves to the right in a bit based on rules, and points to the middle bit of each flag. The processor core executes the μOp segment 147 to gain branch decision ‘0’, which indicates not branching. This branch decision is compared with the middle bits pointed to by the pointer of flags attached to all μOp segments. Inconsistent μOps with branch decision in the middle bit of the flag, that is, all μOps of μOp segment 149 and its following μOp segments, whose corresponding flags are ‘11x’, ‘110’ and ‘111’, are aborted to be executed. And all μOps of μOp segment 149 and its following μOp segments, whose corresponding flags are ‘0x’, ‘100’, and ‘101’, are continuously executed by microprocessor core. Then the cache system will make the address read pointer point to sequential new μOp segments of following μOp segments of μOp segment 148, and generates branch hierarchy flags for them. At this point, the write pointer of each flag points to the left bit of the flag, and branch attributes of each new μOp segment are written to the left bit of the flag. At this point, because the processor core has already executed the comparison of the branch decision to the original left bit of flag, and the μOp has been continued according to the left bit, and the information of the original left bit are no longer useful, therefore, the multiplexing of the branch attribute of new μOp segment stored in the left bit will not cause errors. Flag 140 may be viewed as a circular buffer. It is safe if the branch hierarchy depth of μOps a processor core may simultaneously processing is less than the branch hierarchy depth represented by the flag (in this case, it is the number flag bits). The resulting flag, as well as the μOp, are sent to the processor core for execution as described above. After executing a branch μOp, the processor core also moves the flag read pointer to the right in a bit, to point to the flag right bit to prepare to compare with judgment results of next branch. By repeating the above, the cache system can continuously provide μOps of all possible execution paths to the processor core without branch penalty or m is-prediction penalty, while the branch decisions are unknown.
The bus 168 is flag buses, a with total of 4 buses, each is the output of a the flag unit 152 of the above 4 IRBs, and are received by all 4 IRBs; The 4 flag buses 168 are also named after the name of the driver bus IRB, as A, B, C, D. 4 flag buses 168 A, B, C, D output by 4 IRBs, as well as 4 sets of bit lines (such as bit line 118) are sent to processor core. Accordingly, each of the 4 IRBs outputs a ready signal A, B, C, or D to the processor core to inform processor core to receive the flags on the flag bus 168 of this buffer and the Ops on the bit lines (such as the bit line 118, etc.). The processor core then sends branch decision 91 and the flag read pointer 171 to each IRB to control their flag unit 152. In the tracker that controls the L1 cache, the L1 cache address output by the adder is sent to the selector 155 of each IRB via bus 129, then the controller in IRB will select a selector to select bus 129 in a ‘valid’ IRB to receive the address sent by the L1 cache tracker, and to save its BN1X into register 153, and the BNY is stored into the register 86 via selector 85.
In
Assume the read pointer 88 in the B instruction read buffer 150 points to the μOp segment where branch μOp 141 is located in
Assume the comparison result of B comparator in comparator 154 of A IRB 150 is identical, and the status of A IRB 150 is ‘available’, then that comparison result controls the selectors 155, 85 of A IRB 150 to select the BNY of the branch target address on Op segment 146 on the B bus of the bus 157 to store into the register 86 of the A IRB 150 to update the read pointer 88; The comparison results also control the selector 156 of the A IRB 150 to select the flag and hierarchical branch pointer on B bus in flag bus 168 to be stored in flag unit 152. According to the branch matching request, the flag unit 152 will move the input flag write pointer to the right in a bit, which is now pointing to the left bit, write ‘1’ in that left bit to make it become the flag of the
Op of the μOp segment 146 and place the flag on the A bus of flag bus 168. The decoder 115 in A 150 IRB decodes the BNY on the read pointer 88, and controls to send the μOp segments on the μOp segment 146 to the processor core via bit line 118. The controller in B IRB 150 (as 87 shown in the embodiment of
If the comparison result of the B comparator in the comparator 154 in A IRB 150 is ‘identical’, but the status of A IRB 150 is ‘unavailable’, then the output of the selector 155 will be temporarily stored (not shown in
The selector 85 in B buffer 150 makes default selection to the output of adder 94 for the register 86 to update, and the values in read pointer 88 are added per cycle by read width 135. In a μOp segment including the branch μOp 141, the flag write pointer 138 points to the right bit of the flag. The above mentioned second condition can be used to control the read width to determine the posterior boundary of the μOp segment, that is, the address of the branch μOp. The read width can be limited by methods such as basing on SBNY address, to make the last valid μOp in the μOps sent by B set bit line 118 as a branch μOp, at the same time, the original flag is sent by the B bus in the flag bus 168, and the ‘ready’ signal is sent to the processor core through the B ready bus. In the sequential next μOp segment (here, it starts from the μOp after the branch μOp 141, that is, μOp segment 142), after read pointer 88 adds with read width 135, the next reader pointer will point to the first μOp after the branch operation (the first μOp of the μOp segment 142), and then plural μOps starting from this μOp will be sent. At this point, as crossing over the branch point, so the flag write pointer 138 in B buffer 150 moves to right by a bit (it crossed over the right border and turns around to the left to point to the left bit), then write ‘0’ in this bit. Updated flag will be sent via B bus in flag bus 168, and a ‘ready’ signal will be sent to the processor core by B complete bus. If the branch μOp 141 is the last branch μOp of L1 cache block, at this point, it is the end track point entry that is read from track line 151 addressed by the read pointer 88 of B IRB 150, and the address in this entry is put on the B bus of bus 157. The controller in buffer B determines it as end track point if the SBNY in the entry exceeds the L1 cache block capacity, and issues a sequential matching request to the IRB B. Each IRB compares the address on the B bus of the bus 157 with the address in their register 153, and the result shows no matching. Therefore, the cache system controls the selector 159 to select the address on bus B in bus 157 to send to the L1 cache tracker.
Thus, each (source) IRB 150 reads the entry automatically in its track row 151 with the read pointer 88 and sends it to each (target) IRB 150 for matching via the bus driven by the source buffer on address bus 157. If target IRB 150 matches and is valid, the flags on the source bus coming from the flag bus 168 are stored into the flag unit 152 in the target IRB 150. If the said source entry is not the end track point, then (as crossing over branch point) update the flags; if the source entry is the end track point, then (as not crossing over branch point) keep flags unchanged; The flags in the target IRB 150 are put on the bus driven by the target IRB 150 in the flag bus 168. And the BN1X of above source entry will be stored in the register 153 of the matched target IRB 150, and the BNY is saved into register 86, for starting using the read pointer 88 in the matched target IRB 150 to control the inside 120 to send μOps. When the source IRB 150 sends a synchronous signal to the target IRB 150, target IRB 150 sends the target ‘ready’ signal to the processor core. Then, the selector 85 in target cache 150 selects the output of adder 94, and the read pointer 88 steps on. If the address BN1 read from the source table entry is not matched in any IRB 150 buffer, then the selector bus 159 selects the bus containing that address to send to the L1 cache to read the corresponding L1 cache block. If the table entry is the end track point, then the cache blocks, tracks and other information read from the L1 cache and track table will be stored in source IRB 150, and flags in source IRB 150 will be unchanged. If the table entry is not the end track point, then the cache blocks, tracks and other information read from the L1 cache and the track table will be stored in another ‘available’ state buffer 150, and flags from the source IRB 150 will be stored in the flag unit 152 of this ‘available’ buffer 150 and upgraded.
On such operation, the address pointer 88 in each IRB 150 will both control each respective 120 to continue to provide μOps to the processor core, and automatically checking the branch target address in corresponding control flow information (tracks) of these μOps. And the target addresses of these branches are matched between each IRB 150, if no match is made, it will read L1 cache block from L1 cache to upgrade IRB, and automatically continue to provide μOps on all possible branch tracks after the branch point that has not been made a branch decision for the processor core for speculative execution. The processor core executes the branch μOp to generate branch decision, and uses the branch decision to abort the μOps on the traces that are not selected to execute, and controls each IRB to abort the address pointer on the branch trace of non-selected bus. Please refer to the following embodiment based on
The processor core executes the branch μOp 141 in Op segment 144, whose flag is ‘00x’; C IRB is sending the μOps of the
Op segment 149, whose flag is ‘1x’; D IRB is sending the μOps of the
Op segment 145, whose flag is ‘01x’. The processor core makes a branch decision ‘1’ and sends it to each IRB 150 via bus 91. The flag read pointer 171 selects the left bit of each flag 140 and compares them with the branch decision value ‘1’ on the bus 91. The IRB 150 whose results are different stop their operations, and their states are set as ‘available’. Therefore, B IRB 150 (μOp segment 144), D IRB 150 (μOp segment 145) stop sending the μOp, and their states are set to ‘available’. Accordingly, the processor according to the branch decision 91 aborts the execution of the μOps in the
Op segment 142, 144 and 145, which are partially executed in the processor core. A and C IRB 150 continue sending the
Ops in the μOp segment 148 and 149 to the processor core; and continue reading the entries in their track row 151, and sends the branch target address in the entry to the IRB 150 for matching. If a match is reached in B and D IRB 150, then the subsequent μOp segments of the 148 and 149
Op segment are sent to the processor core by the control of the address pointer 88 of the B and D IRB 150. If it does not match, then it reads a cache block from the L1 cache to store into the ‘available’ B and D IRB 150, and the block is sent to the processor core by the control of the address pointer 88 of the B and D IRB 150.
Ops being executed and the flags in each IRB 150, and decide to abort part of the
Ops inside and addresses in the tracker in part of the 150s.
The below illustration will be made based on
The read pointer 127 addresses the L1 cache to read the entire L1 cache block to send to the instruction read buffer 150 in B IRB 120 to be stored. Also, it uses the BNY in read pointer 127 as the starting address, and based on the pointer and the read width 65 calculated by the entry 33 in the block offset mapper 93 addressed by the pointer, directly reads from the L1 cache 24 to send valid Ops to the processor core 128 via the cache specific bus 48. The processor core identifies these μOps with flags from the B bus on the flag bus 168 of the available B IRB 150. Meanwhile, the tracks in the track table 80 addressed by BN1X on the read pointer 127 are sent to the B 150 IRB via bus 163, and stored in the track row 151; and the entry 93 in the block offset mapper 33 is stored in the block offset row 122 in IRB 150 via bus 163. The BNY obtained by adding the BNY in the read pointer 127 and the read width 65 by adder 124, along with the BN1X in read pointer 127, is sent to each IRB 150 via bus 129. Selector 155 in B IRB 150 has been controlled by the system controller to select bus 129, therefore, this BNY is selected by selector 85 and stored in register 86 in B IRB 150, and the BN1X is also stored in the register 153 in B 150 IRB. Thereafter, the L1 cache 24 stops sending μOps to the processor core 128, while the B IRB 150 will send μOps to the processor core 128 via its bit line 118.
Therefore, the processor system in the embodiment of Ops and part of the address read pointer 88 in IRB 150 according to the branch decision 91 and flag read pointer 171. For the detailed operations please refer to the following embodiment.
Ops, and the next branch target address in its track row 151 will be sent to each IRB for matching via bus 157 as described above. And, when the number of μOps in μOp segment 164 is far larger than the number of μOps in μOp segment 142, causing flags in each IRB 150 to be ‘00x’, ‘01x’ and ‘1xx’ (output μOps segment 144, 145 and 146, and another 150 can be in the ‘available’ state), if the read pointer 171 points to the left bit of flag 140 in each IRB 150 (branch decision corresponding to branch point 141), the branch decision 91 is ‘1’, then the IRB 150 with flags of ‘00x’ and ‘01x’ (μOp segment 144 and 145) will stop operation, and the states are changed to ‘available’; and IRB 150 with flag of ‘1xx’ (output μOp segment 146) will continue to send the following
Ops, and the next branch target address in track row 151 will be sent to each IRB 150 for matching via bus 157.
When the processor core 128 has not made branch decision to a branch point, it will speculate execute the μOps in plural traces after the branch point at the same time, after that the branch decision 91 will select the execution result of one trace to commit to the architecture register, and abort the μOps on other traces. Op, and is also sent to any entry of the reservation station whose operand is that result, and the reservation station entry corresponding to the μOp is released for reallocation. When the μOp is decided to be non-speculative, the μOp ROB state is marked as ‘finished’. When the head singular or plural entries of ROB 182 are ‘finished’, the results in these entries are committed to the register 184, and the ROB entries are released for reallocation.
Speculate Out of Order Execution is not executed in order, but the issuing and committing are sequential. The processor core 98 based on branch prediction executes a single trace determined by branch prediction; the issue sequence of the trace is sent sequentially by the cache system to inform the processor core, and the processor core 98 stores it sequentially into the ROB. The name dependency (WAR,WAW) of the processor core 98 to each μOp is removed by the rename of register; and true data hazard (RAW) is promised by the ROB entries recorded in the reservation station according to the order that μOps are sent in. The commit order is guaranteed by the ROB order (essentially FIFO buffers). In the embodiment of
In Ops from plural IRB 150 respectively through their word lines 118 and search the register alias table to do register renaming, to remove the name dependency; it also allocates entry of ROB 182 for each
Op; at the same time, it assigns the set of the
Op a controller 188 to control the allocated 182 entry in ROB. The processor core 128 has a plurality of controllers 188.
IRB 150 sends the flag 140 generated in the flag unit 152 and the flag write pointer 138 via the flag bus 168, and stores them into the fields of the same number in the assigned controller 188; it also sends the μOp read width 65 to store into the field 197. The ROB entries assigned for each μOp in the μOp set are stored in the field 176 according to the order of the μOps; the storage field 177 has time stamps. Field 178 stores the reservation station table number assigned by respective μOps in field 176. The total number of ROB table entries allocated is equal to the read width 65. At the same time, IRB 150 provides a time stamp to store into the field 177 of the controllers 188 assigned in the same cycle.
For true data hazard RAW, the set of Ops in the corresponding field 176 of the controller 188 needs to check their hazard according to the
Op order; if there is RAW hazard between the μOps, then when it assigns reservation station for the
Op of the read register, it writes the ROB entry number of the
Ops of the corresponding write register into the reservation station to replace the register address. In addition, it needs to detect the hazard between it and the μOps on the same branch before this set. There are two cases, one is to compare the new assigned controller 188 flags with the flags in other valid controllers 188, if they are identical and the time stamps of the other controllers 188 are ahead of the time stamp 177 of the new assigned controller 188, then it needs to detect the RAW hazard between the μOp in other controllers 188 and the μOps in the new assigned controller 188. The second is to detect the valid controllers 188 whose flag write pointer 138 has a higher branch level than the write pointer 138 of the newly assigned controller 188; In the embodiment of
Op block corresponding to the controller 188 with the higher branch level is ahead of the
Op block corresponding to the new assigned controller 188 according to the order of execution, so the branch detection is needed. By detecting the above two cases, if there is RAW hazard, then it needs to store the ROB entry number of the
Op number of the write operand to replace the register number when it issues the
Ops of the read operand to the reservation station.
Each of the μOps issued to the reservation station 183 is dispatched to the execution unit when its operands to be used are valid and the execution units needed by the μOp needs are available, and the execution result is returned to the ROB entry assigned for that μOp to store. At the same time, there can be multiple branches of μOps to be sent by the reservation station, and to be executed by the execution unit. If the buffer system of the embodiment of the Ops for the processor core in the
Op segment in different clock cycles, so at this time, according to the time stamp 177 of each controller 188, they are stored into the FIFO by the time order (the early ones are stored first).
When the μOp is executed in the execution unit 185, the execution result is stored in the corresponding entry in the ROB 182, and the execution status bit of the entry is also set to ‘completed’, and in the corresponding controller of the ROB entry, the field 176 status that records that ROB entry in the field 176 is also set to ‘completed’. The controller number that the commit FIFO output points to a controller 188, wherein the corresponding entries whose status are ‘completed’ recorded in the field 176 are orderly committed to the architecture register 184, and the committed ROB entries are also returned to the resource pool for the use of the register alias table and the allocator 181; when all the corresponding ROB entries of all valid entries in the field 176 has been committed, the controller 188 is also set to ‘invalid’ and returns to the resource pool preparing to be used. At this point, the read address of the commit FIFO steps on, and the next entry of the commit FIFO is read out, and the pointed controller 188 starts committing the corresponding ROB entries. The flag system and the commit FIFO guarantee the sequential commitment of the μOps set, and the ROB entry sequence stored in the field 176 of the controller 188 guarantees the sequential submission of the μOps within the set.
Each time the processor core finishes comparing with the branch decision, the read pointer 171 is shifted right by one bit so that the resulting next branch decision 91 is compared with the next bit in the flag 140 in each controller 188. When the system is reset, the read pointer 171 and the write pointers 138 in each IRB 150 are set to the same value, for example, all pointing to the left bit, to synchronize the read pointer 171 and each write pointer 138. So that the flag system makes the caching system cooperate with the processor core 128 in the embodiment of Ops on several traces in the process of distribution, execution, or writing back, and only the execution result of the
Ops decided by the branch decision are committed orderly to the architecture register. The existing sequential or out-of-order multi-launch cores only needs to slightly modify their ROB so that they can cooperate with the caching system described in
Op to read the operand in the read register physical file 186 to send to the execution unit; the scheduler 187 can send a plurality of μOps to different execution units 185 per cycle. The result of execution by the execution unit 185 is written back to the entry in the register physical file 186, where the register physical file 186 is addressed by the execution result address stored in the assigned ROB 182 entry of the μOp. The scheduler 187 corresponding to the μOp that completes the operation is released for reallocate. When the μOp is determined as non-speculative, the state of the ROB 182 entry of the μOp is marked as ‘completed’, and when the singular or plural entry of the head of the ROB 182 is ‘completed’, the addresses stored in these entries are committed to the register table in the processor core 128 so that the architecture register addresses stored in these entries are mapped into the execution result addresses stored in the same entry, and these ROB entries are released for reallocation. The embodiment shown in
In the out-of-order multi-issue processor system of Ops (or instructions) traces that contains different numbers of
Ops (or instructions), so that the simple sequence is not enough to guarantee the logic of the program is correctly executed and expressed. The present disclosure issues the
Ops (or instructions) in the unit of
Op (or instruction) segment that ends with single
Op (or instruction), and uses a flag (flag) system to send the branch relationship of the
Op (or instruction) segments from the issue end (IRB in this disclosure) to the commit end (ROB in this disclosure), and uses the branch decision 91 generated by the processor core to select one branch of the branch to commit to guarantee the logic of the program is correctly executed and expressed. Its operation does not affect the execution of the program between the issue and the commitment; therefore, it can work together with various execution modes such as sequential execution or out-of-order execution, various instruction set architectures such as fixed or variable-length instruction set, various implementation technologies, such as the register renaming, the reservation stations, the schedulers and so on.
Since the embodiment in
The existing multi-issue processor requires the cache system to store instructions or μOps required by the processor core in the instruction buffer, such as the IRB 150 in
The read scheduler 158 in the IRB 200 is similar to the read scheduler 158 in the embodiment of
The tracker in IRB 200 also varies depending on the method that the entry is read. The IRB 200 does not send out a number of instructions according to itself in each cycle but outputs a starting address by its tracker read pointer 88, and the track row 151 addressed by the read pointer 88 outputs the SBNY field 75 in the entry as the end point address to output. And the entry between the start address and the end address in the register set 201 in the IRB 200 is accessed by the scheduler and so on. Where the tracker uses the incrementor 84 but not adder 94, and the input of the incrementor 84 is connected to the SBNY field 75 on the output of the track row 151. In addition, a subtractor 121 is added to find the difference between the end address and the start address as the read width 65 for ROB to use.
The allocator 211 contains an address extractor, an instruction hazard detector, and a register alias table. The allocator 211 is triggered by the ready signal from the IRB 200, and stores the corresponding flags on the flag bus 168. The address extractor reads the entry 202 of the IRB 200 between the start address and the end address from the IRB 200, and extracts the operand architecture register address and the target architecture register address, which are send to the instruction hazard detector for hazard detection. The instruction hazard detector also detects its hazard with the operand architecture register address in the IRB 200 according to the target architecture register address of the parent instruction segment sent by the ROB 210. The instruction hazard detector queries the register alias table based on the result of the detection, and the register alias table renames the operand architecture register address in field 202 to the operand physical register address and stores it back into the field 203 of the IRB 200 entry. The register alias table also renames the target architecture register address in the field 202 into the target physical register address and stores it into the ROB block 190 allocated by the instruction segment in the IRB 200. The 211 records the assigned physical register resources by ROB blocks respectively. There are also flags in each list. In 211, the branch unit generated flag read pointer 171 selects one of the flag 140 from the flags in the lists and compares it with the branch unit generated branch decision 91. The physical registers in the lists, whose comparison result are ‘different’, are released. When a ROB block 190 is completely committed, the physical registers in its corresponding list are also released.
The issue rule may also be set to issue when the issue pointer 209 is greater than or equal to the flag write pointer 138, which allows out-of-order issue to across the branch level. At this time, the right shift of the pointer 209 can be determined by the length of the queue or the amount of the resources, for example, when the queue is shorter than a certain length, the launch pointer 209 is shifted right. The issue priority order may also be determined using the branch prediction stored in the field 76 of entry in the track row 151. At this time, the bus 75 sent from the IRB 200 has a field 76 branch prediction in addition to SBNY. Assuming that the field 76 is a binary bit, the scheduler 212 compares the branch prediction value of the field 76 with the bit in the flag 140 of each entry pointed to by the issue pointer 209, and those with the ‘same’ comparison results are issued in priority. The last μOp in a μOp segment is the branch μOp, which means that the last μOp in the entry of the controller should be the issued in the highest priority. The scheduler 212 may detect whether the SBNY address on the field 75 exceeds the size of the L1 cache block to exclude the end track point (which is not a branch μOp and does not require priority issue) when the 207 is filled in accordance with the start address and the end address. The read pointer 171 generated by the branch unit selects one bit of all the valid flags 140 in the controller 199 to be compared with the branch decision 91. If the comparison result is the ‘same’, the corresponding entry will not be operated, and will continue to issue according to the BNY address in the entry. If the result of the comparison is ‘different’, the valid bit of the flag 140 in the corresponding entry is set to ‘invalid’. If the valid bits in all of the sub-controllers 199 corresponding to one IRB 200 are ‘invalid’, it means the μOps stored in controller 199 pending to be issued are either all issued or all aborted. The state of that IRB 200 is ‘available’ at this time, and the L1 cache block from the L1 cache 24 and the corresponding track and so on can be written to the IRB 200. The IRB 200 is not available when at least one of the active bits in the sub-controller 199 within the controller 212 corresponding to that IRB 200 controller is ‘valid’. That is, whether the IRB 200 content can be overwritten is newly determined by the controller state in the scheduler 212.
If the L1 cache block 213 is not sufficient to accommodate all of the μOps, an extra L1 cache block (such as the L1 cache block 214 in
If the L1 cache is a fully associated structure, for example, the L1 cache structure addressed by the mapping of the block address mapper 81 in the embodiment of
Assuming that the address bus 157 has a branch target address, and the flag bus 168 has a flag of its source branch point and the matching request. Assuming that the read scheduler 158 in the D IRB 200 in
The allocator 211 is triggered by the ‘ready’ signal on the ready bus D, and according to the address ‘3’ on the D bus 88 and the address ‘6’ on the D bus 75, reads the μOps from field 202 of IRB 200 entries with BNY address 3,4,5,6. The system performs dependency check on the operand register addresses and target register addresses of the μOps. The ROB 210 is triggered by the ‘ready’ signal on the ready bus D and makes each of the controllers 188 executes two operations. One is detecting branch history of the ‘unavailable’ ROB blocks 190 based on the flags on the D bus of the flag bus 168. As described above, the branch history detection checks the ROB block that has higher branch level than the ROB block waiting to be assigned, then sends the target register address in filed 193 of the valid entry of the ROB blocks with grandfather and father flags of the μOp segments being checked allocator 211 via bus 226. Perform dependency check on the said target register with the operand register addresses in the entries with BNY addresses 3, 4, 5, 6. The allocator 211 queries the register alias table according to the result of the dependency check, and renames the register address of each architecture register.
Another operation executed by each controller 188 is to detect the presence of available ROB blocks 190. If there is no available ROB block 90 in the ROB 210, the feedback ‘unavailable’ signal is sent to the scheduler 212, and the scheduler 212 suspends the register 86 in the D number IRB 200 to be updated. If the ‘U’ ROB block 190 in the ROB 210 is ‘available’, it feeds back ‘available’ signal to the scheduler 212, and the flags on the D bus in the flag bus 168 are stored in the flag 140 of the controller 188 of the ‘U’ ROB block 190 and the flag write pointer 138, and the starting address on the D bus of bus 88 is stored into the field 176, and the reading width ‘4’ on D bus of bus 65 is stored into the field 197 of the controller 188, which makes only number 0-3 entries in that ROB block valid. The assigned ROB block 190 label ‘U’ is sent back and stored to the field 204 in the ‘D’ IRB 200.
The allocator 211 executes the hazard detection and the register renaming in the method described in
The scheduler 212 receives the information that the ROB block 190 has been allocated based on the request on the ready bus D, that is, based on the start address ‘3’ on the D bus of the bus 88, and the end address ‘6’ on the D bus of the bus 198, the BNY address ‘3, 4, 5, 6’ are stored in a sub-controller 199 in the D controller. The scheduler 212 then updates the register 86 in the D IRB 200, wherein the selector 85 in the D IRB selects the output of the incrementor 84 in the D IRB so that the read pointer 88 in the D IRB is the value is the SBNY value ‘6’ on the bus 75 plus ‘1’ which is ‘7’, which is the starting address of the next instruction block. At the same time, the scheduler 212 also updates the flag unit 152 in the D IRB 200, since the read pointer crosses the branch point of the BNY address ‘6’, so that the flag write pointer 138 in the flag unit 152 is shifted by one bit to the right, and ‘0’ is written in the bit of the flag 140 pointed to by the write pointer 138. The new flag 140 and the new flag write pointer 138 are placed on the D bus of the bus 168. The flag unit 152 also sets the ready signal D to ‘ready’ and the allocator 211 requests ROB 210 Block 190 for the allocation of the ROB based on the ready signal, and reads the target register address in the ROB block with higher branch level for hazard detection. The read pointer 88 of the D IRB 200 also reads the next entry from the track row 151 where the BN1X field 72 address and the BNY field 73 address are placed on the D bus in the bus 157 to each IRB 200 for matching. The SBNY field 75 in this entry is placed on the bus D of the bus 198 as the end address. The subtractor 121 obtains the read width 65 by subtracting the value on the field 75 by the value on the read pointer 88 and adding ‘1’. The start address is sent via the D bus of bus 88 and the end address is sent via bus D on the bus 198. And the read width is sent via the D bus of bus 65 to scheduler 212, allocator 211 and ROB 210. The operation are similar of the above to allocate resources for the next μOp segment.
The scheduler 212 queries the operand valid signal in the field 203 in the 3, 4, 5, and 6 entries in the D IRB 200 according to the BNY address stored in the sub-controller 199 in D controller therein. Dispatch the μOps in the entry with the largest BNY first, because that entry may store branch μOp. At this point, if all the operands in the entry with BNY of 5 are valid, the scheduler 212 selects the queue 208 of the execution unit 218 that can execute the operation type according to the operation type of the field 202 in the entry, and the IRB number ‘D’ and BNY value ‘5’ are stored into the queue (of course, the following register address, operation, execution unit, etc. can be stored directly into the queue). When the IRB number and the BNY value reach the head of the queue 208, then according to that value, the operation type in the field 202 in the entry with the BNY of “5” in the D IRB 200, the target physical register address in the field 203, the ROB block number ‘U’, BNY ‘5’, and the flags in the sub-controller 199 are read and sent via the bus 215 to the execution unit 218; the operand physical register address and the execution unit number 216 in the field 203, and the flags in the sub-controller 199 are also read and sent to the register file 186 via the bus 196. The register file 186 reads the operand by the operand physical register address and sends it to the execution unit 218 according to the execution unit number via bus 217 for execution. The execution unit 218 executes operations on the operand according to the operation type. Upon completion of the operation, the execution unit 218 stores the execution result into the register file 186 via bus 221 according to the target physical register address sent by the IRB, and sends the ROB block number ‘U’ and BNY ‘5’ to the ROB 210. The ROB 210 sends BNY ‘5’ to the UROB block 190, where the controller 188 subtracts ‘5’ with the start address ‘3’ in its field 176 to get ‘2’, so that the execution status bit 191 in the number 2 entry is set to ‘finished’. The field 194 of entry number 2 has stored the same target physical register address in which by the operation result is written. The ROB block 190 commits via the commit FIFO in the order of the branch level with the flags as previously described. When an entry in the ROB block is committed, the addresses in fields 193 and 194 in the entry are sent to allocator 211 via bus 126. The allocator 211 maps the architecture register address in the field 193 to the physical register address in the field 194 in its register alias table, i.e. the subsequent access to the architecture register recorded in the field 193 accesses the physical register recorded in the field 194. It is possible to optimize the structure that not storing the target physical register address in the 203 field of the IRB 200, but when the queue 208in the allocator 212 is sending the operation type and operand via the bus 215 to the execution unit 218 for execution, it also sends the unit number that is being executed by 218 to the physical register 186; sending the execution unit number of 218 along with the ROB block number ‘U’ and the BNY address to the reorder buffer 210 to read the target physical register address to send to the physical register 186; the execution result of the 218 is matched with the physical register address from 210 in 186 according to the execution unit number of 218, and the address is used to store.
The branch unit 219 executes branch μOp and generates branch decision 91. The branch unit 219 also generates a flag read pointer 171, which moves right by one bit each time a branch μOps is executed. The branch unit 219 sends the branch decision 91 and the flag read pointer 171 to the allocator 211, the scheduler 212, the ROB 210, the execution unit 218, 185, etc., and the physical register 186. The flag read pointer 171 selects one bit of all the valid flags in each unit to compare with the branch decision 91, where the operations on 211, 218, 185, 186 are similar to those of the embodiment in
When the unconditional branch μOp issues, it does not need to issue its subsequent μOp. The controller in the IRB 200 (similar to 87 of the previous embodiment) detects the type field 71 of each entry in its track except the rightmost column (the end track point). In the case of an unconditional branch type, the register 86 in the operation tracker is not updated after sending the address of the corresponding μOp address via the 198 bus, that is, the μOp after the unconditional branch μOp is not issued. So that the μOps of other traces can use the resources in the processor. In this optimization, the branch unit 219 executes unconditional branch μOps as usual, generates a branch decision 91 value ‘1’ and a flag read pointer 171. Under this situation, the branch attribute ‘0’ branch and its children, grandchildren flag after the unconditional branch do not exist. And the processor resources are all used on the branch attribute ‘1’ branch and its son, grandson flag after the unconditional branch.
Another optimization can be used to create flag read pointer 171 in each unit, where the branch unit only needs to send a stepping signal to each unit after executing a branch instruction or a branch operation to make the flag read pointer of each unit move to the right by a bit. All flag read, write, and issue pointers can keep synchronized by resetting to point to the same flag bit when the system starts
The operation above is performed by the tracker in the IRB 200 reading the branch target in the track row 151 and passing it to each IRB 200 via bus 157 so that the μOp is read from the cache system into the IRB register. The IRB 200 divides the μOps into μOp segments ending with branch μOp, and provides the start address 88 and the end address 75 of the μOp segments. The IRB 200 also generates a ready signal for each μOp segment based on the branch level and branch attribute of the μOp segment, and generates the flag 140 and the branch write pointer 138 to send to the allocator 211, the scheduler 212, and the ROB 210 via the flag bus 168 respectively. The allocator 211 allocates resources according to the flag for the μOp segment, with the resources including the physical register 186 and the ROB block 190 in the ROB 210. The scheduler 212 issues the μOps according to the order of branch level in the flag and fetches the operand from the physical register 186 to the execution unit 185, and the execution result is written into the physical register 186, and the execution state is recorded in the ROB 210. The branch unit 219 executes the branch μOp, generates branch decision 91 and the read pointer 171 and sends them to the allocator 211, the scheduler 212, the execution units 185, 218, etc., the physical register 186, and the ROB 210. μOps that do not comply with the execution trace of the program should be abandoned in all pipelines from the source. Finally, ROB 210 commits the execution result of the μOp that fully complies with the program execution trace to the allocator 211. The allocator 211 renames the physical register address of the execution result to the architecture register address and completes the retirement of the μOps.
The present embodiment forms a clear address mapping relationship between instruction sets of different addressing rules, extracts the control flow information embedded in the instruction, and stores the control flow net. A plurality of address pointers are used to automatically pre-fetch instructions to store into the upper level memory from the low-level memory automatically stored along the stored control flow net, and each address pointer can read the instructions in all possible execution traces within certain control node (branch) level from a multi read-port high-level memory following the said program control flow net, and send all of the instructions to the processor core for a full speculate execution. The above range size setting depends on the time delay at which the processor core makes branch decisions. In this embodiment, the possible subsequent instructions/μOps of the instructions/μOps in each memory level are at least in, or is being stored in the memory one level lower. In the high-level memory that the processor core can access directly, the address mapping between instruction sets with different addressing rules has been completed and can be addressed directly by the address pointer used internally by the processor. The present embodiment synchronizes the operations of the functional units of the processor system with a hierarchical branch flag system. The address pointer assigns a flag with a range branch history based on the branch level according to the branch trace and the branch attribute. Each speculate executed instruction has it corresponding flag when it is stored temperately or operated in the processor core. The scheduler issues instructions in the order according to the branch level in the flag, and can decide the priority sequence of the different traces in the same branch level according to the branch attribute of the instruction and its branch prediction value, and can also dispatch the branch instruction first. The branch unit executes a branch instruction and produces branch decision with branch level mark. The level branch decision is compared with the flags of each pointer and each instruction at the same branch level, so that the processor core aborts the instructions at the branch level with branch attributes differing from the branch decision and instructions in their child and grandchild branches, and continue executing the instructions at the branch level with the same branch attributes as the branch decision and instructions in their child and grandchild branches. The resources occupied by the pointers and instructions which are abandoned by the branch decision are used for the child and grandchild branches of the pointers and instructions that continue to be executed. Repeating the above, the processor system of this embodiment is capable of executing the μOps translated from instruction non-stop, hiding the branch delay and branch penalty, and the cache miss is also lower than the existing processor system employing μOp caches.
It should be understood that the various components listed in the above embodiments are for ease of description only and other components may be included, some components may be combined or omitted. The described components may be distributed in a plurality of systems physically or virtually, and can be implemented by hardware (such as the integrated circuits), software, or the combination of hardware and software.
It is understood by one skilled in the art that many variations of the embodiments described herein are contemplated. While the invention has been described in terms of an exemplary embodiment, it is contemplated that it may be practiced as outlined above with modifications within the spirit and scope of the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
201510091245.4 | Feb 2015 | CN | national |
The application is the U.S. National Stage of International Patent Application No. PCT/CN2016/074093, filed on Feb. 19, 2016, which claims priority of Chinese Application No. 201510091245.4 filed on Feb. 20, 2015, the entire contents of all of which are hereby incorporated by reference.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2016/074093 | 2/19/2016 | WO | 00 |