This disclosure generally relates to computer, communication and integrated circuit.
CPU of store program computer generates addresses and sends them to memory, fetching instructions and data from memory to CPU for execution, and sends back the execution results to memory for storage. The memory capacity increases as the technology advance, resulting in increasing memory access latency, and increasing memory access channel latency. However, the CPU execution speed increases as the technology advances, therefore, the memory access latency becomes the bottleneck of computer performance advancement. Therefore, store program computer employs cache to hide the memory access latency to ease this bottleneck. CPU accesses instructions or data from cache by the same method. The processor core in CPU generates addresses and sends them to cache, the cache returns the corresponding information to the processor core for execution, if the addresses match with the tags store in cache, and thus averts the memory access latency. The cache capacity increases as the technology advance, resulting in increasing cache access latency, and increasing cache access channel latency. However, the processor execution speed increases as the technology advances, therefore, the cache access latency becomes the worse bottleneck of computer performance advancement.
The afore method of processor core fetching information (including instruction and data) from memory for execution may be viewed as the process of processor core pulling information from memory. Pulling information has to endure the latency channel twice, once is the processor sending address to memory, the other is the memory sending information to the processor core. In addition, to support information pulling method, all store program computer or processor employs functioning blocks generating an keeping the addresses. Store program computer has instruction fetching pipeline stages in its pipeline. Modern store program computer uses employs a plural number of pipeline stages to fetch instructions, and thus deeper the pipeline and increases the penalty when branch prediction takes place. In addition, to generate and to keep a long instruction address consumes substantial energy. In particular, computer which converting the variable length instruction to a fix length micro-op is costly, as the computer may need to reverse convert the fix length micro-op address back to variable length instruction to index the cache.
The disclosed methods and systems are directed to solve one or more problems set forth above and other problems.
This disclosure proposes a processor system comprised of a serving cache and a corresponding processor core; wherein the processor core does not generate and maintain an instruction address, nor does its pipeline contains an instruction fetching segment; The processor core only provides to the serving cache a branch decision, and a base address stored in the register file when an indirect branch instruction is executed. The serving cache extracts the control flow information in the stored instruction and stores it, and supplies (serves, pushes) the instruction to the processor core for execution according to the control flow information and the branch decision; The serving cache provides the correct indirect branch target instruction for execution to the processor core based on the base address from the processor core when an indirect branch instruction is encountered. Further, the serving cache may provide the processor core with the fall through instruction and the branch target instruction. The branch decision generated by the processor core chooses to execute one of the two instructions, and thus it is possible to mask the delay of the processor core to transfer the branch decision to the serving cache. Further, the serving cache may store the base address of the indirect branch instruction and the corresponding indirect branch destination address, so that it can reduce or eliminate the delay when pushing the indirect branch target instruction, thus partially or completely masking the delay of the base address transfer from the processor core to the serving cache. Further, the serving cache may forward instructions to the processor core in advance, based on the control flow information stored therein, thus partially or totally obscuring the delay of transmitting information from the serving cache to the processor core. The processor core of the processor system proposed by this disclosure does not need to have pipeline segment to fetch instructions, nor does it need to generate and record the instruction address.
This disclosure proposes a plural hierarchical organization, and the last (lowest) level cache (LLC) is a group associative organization with a virtual-physical address translation look-aside buffer TLB and a tag unit TAG. Virtual address is transformed into a physical address by TLB, and the resulting physical address of the memory is matched with the contents of the TAG to obtain the cache address of the LLC. Since the LLC cache address is mapped from the real memory address, the LLC cache address is actually a physical address. The resulting LLC cache address can be used to address the LLC's information memory RAM and can also be used to select the LLC Active list. The LLC active list stores the mapping between the LLC cache block and the cache block in the higher layer cache, that is, the LLC active list is addressed by the LLC cache address and its entry is the corresponding higher cache block address. In this disclosure, other caches other than LLC are all associative organizations, which are addressed directly at their own cache address, and do not require the tag unit TAG or TLB. The cache address of the present level is mapped with the higher-level cache address through the active list mapping. The active list is similar to the LLC active list, and is addressed by the cache address of this level and the higher-level cache address is stored in the entry. The highest-level cache has a corresponding track table (TT), which stores the control flow information extracted by the scanner from instructions stored in the highest-level cache memory RAM. The track table is addressed by the highest-level cache address, and its entry stores the branch target address of the branch instruction. The tracker (TR) generates the highest-level cache address addressing the first read port of the highest-level cache memory and output the fall-through instruction to the processor core; and the corresponding branch target address is also read out from the corresponding entry in the track table according to the highest-level cache address, and uses the said branch target address to address the second read port of the highest-level cache memory to output the branch target instruction to the processor core as well. The processor core executes the branch instruction to generate branch decision, selects one of the above two instructions and drops the other one. The said branch decision also controls the tracker to select one of the two branches correspondingly to address the highest-level cache to continuously push instructions to the processor core.
This disclosure proposes a cache replacement method that determines a cache block to be replaced according to the degree of association between cache blocks. The track table records the jump paths from the branch source to the branch target. This disclosure additionally uses the correlation table to record the corresponding lower-level cache address of the cache block, the jump paths of branch sources into the target cache block, and the number of branch sources jumping into the cache block. Defined the count of the branch source jumping in a cache block as the association degree of the cache block. A cache block with least degree of correlation, that is smallest the count, is the candidate to be replaced first. Between the cache blocks with the same degree of correlation, the oldest cache block is replaced first; to avoid the replacement of the cache block has just been replaced. When a cache block is replaced, use the jump paths (the branch source address) of this block stored in the correlation table to find the branch source entries in the track table, and replace each of the target address in the entries by the lower-level cache address stored in the correlation table. Therefore, this keeps the integrity of the control flow information stored in the track table. The above description is the replacement based on the degree of correlation within the same memory hierarchy level.
The minimum correlation degree replacement method can also be applied between different memory hierarchy levels. The method is to record the number of high hierarchy level cache blocks whose content are identical to a lower level cache block as the degree of correlation of the lower level cache block. The smaller the count, the less the degree of correlation. The lower level block with the least correlation is to be replaced. This method can be named the Least Children method, wherein Children means the higher-level cache blocks whose content is identical to the cache block. Also record the number of entries in the track table with the cache block as branch target (cache blocks and track tables can be at different memory hierarchy levels).When both counts are ‘0’, the cache block can be replaced. If the children count is not ‘0’, this cache block can be replaced after the replacement of its children. If the count of track table entries targeting a cache block as branch target is not ‘0’, this block can be replaced after the count changes to ‘0’, or can be replaced when each of the addresses of this cache block in all track table entries targeting this block is replaced by the lower-level cache address. The minimum degree of correlation between memory levels can also work together with the replace-the-oldest method described above.
This disclosure provides a method of temporarily storing the tracker and the register state in the processor core into memory identified by a thread number. The memory and the tracker and the register status of the processor core may be interchanged by threads in order to switch threads. Since the thread instructions in the serving cache of this disclosure are independent, there is no need to clear the cache when changing the thread, and there is no case where a thread executes an instruction of another thread.
This disclosure proposes a method and a processor system that can directly execute instructions provided by caches of a plurality levels of memory hierarchy.
This disclosure proposes a function call and function return method and system based on track table.
This disclosure provides a computer memory hierarchical organization method and system. With the exception of the hard disk, the memory hierarchy, including the traditional main memory, are organized as cache and are managed by hardware without memory allocation by operating system. This way of instruction or data reading does not need a matching the tag unit, thus reducing the read delay.
This disclosure proposes a fully associative cache method which preserves the bi-directional mapping relationship of data at different memory hierarchy levels, and avoids the tag address matching based on the bidirectional address mapping. Before executing a load instruction, the cache system serves the data to the processor core in advance according to the extracted and reserved stride information and interrelationships obtained when the same load instruction was executed before.
This disclosure proposes a method and system for extracting and recording the relationship (i.e., data address information contained in the data) between data organized in a logical manner. Based on the execution result of the load instruction, the method and the system learn, extract and reserve in the data track table, the logical relationship of data. The entries in the data track table correspondent-to-one to the data memory entries. The data track entry corresponding to the ‘data’ in the data memory preserves the ‘data type’ generated by the analysis of the relationship between the data. The data track table entry corresponding to the ‘address’ in the data memory preserves the after mapping ‘address pointer’ The ‘address pointer’ can directly address the data memory to read data without the need of matching by the tag unit. The method and system serve data to the processor core according to the interrelationships between the data before the logical relationship is extracted. After the logical relationship is extracted, the method and the system reads data and serves data to the core before the load instruction is executed, based on the logical relationship extracted the last time the same load instruction was executed and preserved in the data track table.
The memory hierarchy method and system of this disclosure autonomously serve most of the instructions and data to the processor core; in most cases, the processor core is only responsible for providing branch decisions or comparison results, and the processor's pipeline stop signal.
This disclosure provides a memory hierarchy and method that can access a memory hierarchy at the other end of a communication channel with a uniform memory address.
This disclosure provides a processor system comprise of a processor core and a cache, wherein the cache serves instructions and data to the processor core for the processor core to execute and process.
The system and method of this disclosure may provide a basic solution for bi-directional delay of processor core accessing cache in a processor system. In a traditional processor system, the processor sends a memory address to the cache, which sends information (instructions or data) to the processor core according to the memory address. The system and method utilizing the correlation between instructions of this disclosure serve the instructions from the cache to the processor core, avoiding the delay of the processor to send the memory address to the cache. In addition, the serving cache of this disclosure is not a part of the processor core pipeline, so that the instructions can be served in advance to hide from the cache to processor core delay.
The system and method of this disclosure also provides a multilevel cache organization in which virtual-physical address translation and address mapping are carried out only at the lowest level cache (LLC). Rather than performing virtual-physical address translation at the highest-level cache, and performing address mapping at each level of cache as in the conventional cache, each of the cache level in the multi-level serving cache an be addressed by a cache address. The cache addresses are obtained based on memory physical address mapping such that the cost and power consumption of the fully associative cache is similar to the direct mapping cache.
The system and method of this disclosure also provides a cache replacement method based on the degree of correlation between data blocks. The method is suitable for cache based on the relation between instructions (control information flow).
Other advantages and applications of this disclosure will be apparent to those skilled in the art.
The high-performance cache system and method proposed by this disclosure will be described in further detail below with reference to the accompanying figures and specific examples. The advantages and features of this disclosure will become more apparent from the following description and the claims. It is to be understood that the figures are in a very simplified form and are used in non-precise proportions only for the purpose of facilitating and clarifying the embodiments of the invention.
It is to be understood that, in order to clearly illustrate the contents of this disclosure, this disclosure contemplates a number of embodiments to further illustrate the different implementations of the invention, wherein the plurality of embodiments is enumerated but not exhaustive. In addition, for the sake of simplicity of explanation, the contents already mentioned in the preceding embodiment are often omitted in the latter embodiment, and therefore, the contents not mentioned in the following embodiments may be referred to the previous embodiments.
While the invention may be extended in various forms of modifications and substitutions, and some specific implementation diagrams are set forth in the specification and are illustrated in detail, it is to be understood that the starting point of the inventor does not limit the invention to the particular examples. On the contrary, the starting point of the inventor is to protect all improvements, equivalent conversions and modifications based on the spirit or scope defined by this claim. The same component numbers may be used for all drawings to represent the same or similar parts.
In addition, some embodiments have been simplified in the present specification in order to provide a clearer picture of the technical solution of this disclosure. It is to be understood that altering the structure, delay, clock cycle differences and internal connection of these embodiments within the framework of the technical solution of this disclosure is intended to be within the scope of the appended claims.
The cache in the processor system can be improved with a data structure called a track table (TT here after). The track table not only stores the branch target instruction information of branch instructions, but also stores the fall-through instruction information.
Only the fields 12 and 13 are shown in the track table 10 of
The blank entry in track table 10 shows the corresponding non-branch instruction, and the remaining entries correspond to the branch instruction, which also shows the L1 cache address (BN1) of the branch target (instruction) of its corresponding branch instruction. For a non-branch instruction entry on a track, the next instruction to be executed may only be the instruction represented by the entry on the right of the track. For the last entry on the track, the next instruction to be executed can only be the first valid instruction in the L1 cache block pointed by the content of the last entry on the track; For a branch instruction entry on a track, the next instruction to be executed can be an instruction represented by an entry on the right side of the entry, or an instruction pointed by the BN in its entry, which is selected by the branch decision. Thus, the track table 10 contains all the program control flow information for all the instructions stored in the L1 cache.
Refer to
Returning to
Refer to
The operation of the processor system of the embodiment of
Refer to
Scanner 43 scans the instruction blocks stored from the L2 cache memory 42 to the L1 cache memory 22, calculates the branch target address of the branch instructions directly. The method is adding the branch offset in the branch instruction onto the memory address of the branch instruction itself. The calculated branch target address is selected by selector 44 and is sent to TLB/tag unit 41 to match. The AL2 40 is accessed by using the matched L2 cache address BN2. If the instruction corresponding to the L2 cache address has already been stored in the L1 cache memory 22, then the corresponding entry in 40is valid, at this time the BN1X block address in that entry and the type of that branch instruction generated by the scanner 43 and the block offset BNY are combined into a track table entry. If the instruction corresponding to the L2 cache address has not been stored in the L1 cache memory 22, then the corresponding entry in 40is not valid, at this time the L2 cache address BN2 (including the block offset BNY) got by the matching above and the type of that branch instruction generated by the scanner 43 are combined into a track table entry. The so generated track table entries are written into a track in the track table 20 which corresponds to the said instruction block of memory 22, in the order of the instructions. Thus the extraction and storage of the program flow contained in the instruction block are completed.
The read pointer 28 generated by the tracker 47 addresses the track table 20 to read the entry and output it via bus 29. The controller 27 decodes the branch type and the address format of the output entry. If the branch type in the output entry is direct branch, and the cache address is BN2 format, the controller 27 addresses the AL2 40 with the BN2 address If the entry in 40 is valid, the BN1X in the entry is filled into the track table 20 to replace the BN2X in the entry so that it becomes the BN1 format; If the entry in 40 is invalid, it uses that BN2 to address the L2 cache memory 42, reads the instruction block to fill in one L1 cache block of the L1 cache memory 22 provided by a cache replacement logic, and fills the L1 cache block number BN1X in the invalid entry in 40 and sets it to valid, and fills that BN1X into the entry of the track table as above, and replaces that BN2 address with BN1 address. The BN1 address written into track table 20 above can be bypassed to bus 29 to the tracker 47 for use. If the branch type output by bus 29 is direct branch, and the cache address is BN1 format, the controller 27 makes it directly sent to the tracker 47 for backup.
If the branch type output by bus 29 is indirect branch, the controller 27 controls the tracker to wait for the processor core 23 to calculate the indirect branch target address and send it by bus 46, 44 to L2 cache TLB/tag unit 41 to match. Use the matched L2 cache address BN2 to access the AL2 40. If the corresponding entry in 40 is invalid then use the BN2 address to address L2 cache memory 42 to read the instruction block to fill in a L1 cache block the L1 cache memory 22 as above, and bypass the obtained BN1 to the tracker 47 for backup. The correlation table 37 is a component of the replacement logic of L1 cache 22, and its structure and function will be described in
There are two pipelines before the branch decision pipeline segment in the processor core 23, one of which receives the fall through instruction from the IRB 39, which is named the FT (Fall-through) branch; the other receives the branch target instruction from the L1 cache memory 22, which is named TG (Target) branch. The number of front-end pipeline segments included in the two branches is determined by the pipelined structure of the processor. In this embodiment, two front-end pipeline segments are included as examples. The branch decision pipeline segment in the processor core 23 executes the branch instruction, and one of the two instructions is selected to be executed according to the generated branch decision 31, and the other branch is discarded. In the present embodiment, the IRB 39 can store two instruction blocks as an example, and the IRB 39 is addressed by the IPT read pointer 38 of the tracker 48. The L1 instruction cache 22, the correlation table 37, and the track table 20 are addressed by the RPT 28 of the tracker 47.
When the processor core 23 does not have a decision on the branch, the default value of the branch decision 31 is ‘0’, i.e., do not execute branch, the processor core 23 selects to execute the instruction of the FT branch; when the processor core 23 generates a decision on the branch, if it is decided as ‘do not execute branch’, the value of the branch decision 31 is ‘0’, and the processor core 23 selects to execute the instruction of the FT branch; when the processor core 23 generates a decision on the branch, if it is decided as ‘execute branch’, the value of the branch decision 31 is ‘1’, and the processor core 23 selects to execute the instruction of the TG branch. The selector 33, 25, 35 can be controlled by the branch decision 31, and when 31 is ‘0’, the three selectors select the input on the right; when 31 is ‘1’, the three selectors select input on the left. In addition, the selector 33 and 25 are also controlled by the controller 27 when the processor core 23 does not make a decision on the branch. The operation of the processor system of the embodiment of
M2 is a branch instruction which, when it reaches the pipeline segment of the processor core 23 for branch decision, that pipeline segment executes the M2 instruction to generate a branch decision. If the branch decision ‘31’ is ‘0’, then the processor core 23 selects the M3 and N0 instructions in the FT branch to continue executing, and the J3, K0 instruction in the TG branch is discarded. The branch decision 31 controls the selectors 25 and 35 to select the output of the incrementor 34 to store into the registers 26 and 36 so that both the RPT 28 and the IPT 38 point to N1, the IPT 38 controls the IRB 39 output N1 and the subsequent instructions to the FT branch of the processor core 23 for continuing execution. At this time, the RPT 28 points to the N row in the track table, reads the end entry of the N row, and sends it to the L1 cache 22 to read fall through instruction block of the N instruction block to store in the IRB 39.
If the branch decision ‘31’ is ‘1’, then the processor core selects the J3 and K0 instructions in the TG branch to continue execution, and the M3 and N0 instructions in the FG branch are discarded. At this time, the branch decision 31 controls to store the K row instructions outputted by the L1 cache 22 into the IRB 39 and controls the selectors 25 and 35 to select the output of the incrementor 24 and store it into the registers 26 and 36, and controls both the RPT 28 and the IPT 38 to point to K1. IPT 38 controls IRB 39 to output K1 and subsequent instructions to FT branch of the processor core 23 for continuous execution. RPT 28 points to the K line, and the end entry of the K line is sent to the L1 cache 22 to read the L line and stored in the IRB 39. In this way, the processor 23 can execute instructions without interruption, and without the pipeline pause due to branching.
The tracks in the track table are orthogonal to each other, so they can coexist and do not affect each other. The indirect branch address 46 generated by the processor core in
Refer to
The level 3 active list (AL3 here after) 50 in
When a L2 instruction block in a L3 cache block in the L3 cache 52 is stored in a L2 cache block in the L2 cache 42, the block number of the L2 cache block in the 42 is stored in the entry 80 addressed by L2 sub-address 63 in the row of AL3 50 corresponding to that L3 cache block, and its corresponding valid bit 81 is also set to ‘1’ (valid). The instructions in the L2 cache block is decoded by the L3 scanner 53, wherein the branch offset in the branch instruction is added to the address of the instruction to obtain the branch target address. The address of the next L2 cache block of this L2 cache block is also determined by adding the memory address of this L2 cache block with the size of a L2 cache block. The branch target address or the fall through L2 cache block address is selected by the selector 54 to be matched in the tag unit 51, and if not matched, the address is sent to the lower layer memory to read instructions and the instructions are stored in the L3 cache memory 52. This ensures that for the instructions in the L2 cache memory 42, their branch targets, and the fall through cache blocks are at least in the L3 cache memory 52 or are in the process of being stored into 52.
When a L1 instruction block in an L2 cache block in the L2 cache 42 is stored in a L1 cache block in the L1 cache 22, the block number of the L1 cache block in the 22 is stored in the entry 76 addressed by L1 sub-address 64 in the row of AL2 40 corresponding to that L2 cache block, and its corresponding valid bit 77 is also set to ‘1’ (valid). The instructions in the L1 cache block is decoded by the L2 scanner 43, wherein the branch offset in the branch instruction is added to the address of the instruction to obtain the branch target address. The address of the next L1 cache block of this L1 cache block is also determined by adding the memory address of this L1 cache block with the size of a L1 cache block. The branch target address or the fall through L2 cache block address is selected by the selector 54 to be matched in the tag unit 51, and if not matched, the address is sent to the lower layer memory to read instructions and the instructions are stored in the L3 cache memory 52; if it is matched, then use the 65, 62, 64 parts of the obtained L3 cache address to read the entries 80 and 81 in the AL3 50. If 81 is ‘0’ (invalid), then use the 65, 62, 63, 64 parts of the obtained L3 cache address to address the L3 cache memory 52, reading an L2 cache block to store in an L2 cache block in the L2 cache memory 42, and write the block number 67 of this L2 cache block and the valid bit ‘1’ into the entries 80 and 81 addressed by the L3 cache address in AL3 50.
If the read-out entry 81 is ‘1’ (valid), then using the BN2X values (67 and 64) of the read-out entry 80 to address the AL2 (level 2 active list) 40 to read out entry 76 and 77. If 77 is ‘0’ (invalid), then combine the BN2X value and BNY to be the BN2 address (67, 64, 13) to store in entry corresponding to the said instruction in the track that is being filled in the track table 20. If 76 is ‘1’ (valid), then combine the BN1X value and BNY to be the BN1 address (68, 13) to store in entry corresponding to the said instruction in the track that is being filled in the track table 20. In addition, the branch type 11 decoded by the L2 scanner 43 is also stored in the entry of the track table 20 together with the BN2 or BN1 address. The next block address is matched and addressed in the above-described manner. If the next L2 instruction block is not yet in the L2 cache memory, the instruction block is stored from the L3 cache 52 to L2 cache 42; and the resulting BN2 or BN1 address is stored in the rightmost end entry 16 of the above track. This ensures that for the instructions in the L1 cache memory 42, their branch targets, and the fall through L1 cache blocks are at least already in the L2 cache memory 42 or are in the process of being stored into 42.
The present embodiment discloses a hierarchical pre-fetch function. Each level can ensure that the branch target in this level at least exist, or is being written into a lower level of memory hierarchy. This causes the branch target instructions of the instruction that the processor core is executing in most cases are in the L1 cache or L2 cache, masking the access delay to the lower memory levels.
The corresponding row in the CT 37 is established while the above-mentioned L1 instruction block is filled in the L1 cache memory 22 and the instructions in the cache block are scanned to establish the corresponding track to fill in the track table 20. The BN2X addresses (67 and 64) of the L1 cache block are filled in the field 71 of the corresponding row in CT 37 so that when the L1 cache block is replaced, the BN2X address can replace the L1 cache block number BN1X in entries that targeting that L1 cache block, in order to keep the Integrity of the control flow information in the track table. At the same time, use the BN1X in the branch target in the track being written in the track table 20 to address the row in the CT 37, and increase ‘1’ to the count value 70 in that row, in order to record another branch instruction that uses that row to be the target, and write the L1 cache block number of the track itself into its 72 field, and set the corresponding field 73 to ‘1’ (valid) to record the path of the branch source (address). For the next sequential L1 cache address that is stored in the track end entry, the row in the associated table 37 is also addressed in a similar manner.
The branch target address format in the entry of track table 20 can be BN2 format or BN1 format. When the track table entry is output from the bus 29, the controller (27 of
The Cache Replacement Logic of this embodiment is to use the combination of the Least Correlation (LC) and the Earliest Replacement (ER) (hereinafter referred to as LCER) to determine the cache block that can be replaced. The count 70 in the CT 37 can be used to check the correlation. The smaller the count value, the less number of cache blocks that target the L1 cache block, and the L1 cache block is easier to be replaced. The pointer 74 shared by all rows in the CT 37 points to the row that can be replaced (The count 70 of the replaceable row must be lower than a preset value). When the L1 cache block pointed to by the pointer 74 is replaced, the corresponding track in the track table 20 pointed to by the 74 is also replaced by the new track containing branch types and branch targets exacted from the replacing (new) L1 cache block by the L2 scanner 43. And in the CT 37 entry pointed out by pointer 74, each field 72 with a valid field 73 points to a track in the track table 20. Replacing the branch target addresses of the BN1X address of the cache block being replaced in those tracks by the BN2X address stored in the 71 field of the CT 37 entry pointed by the pointer 74. So each instruction targeting the replaced (old) L1 cache block is now targeting the same instruction stored in the L2 cache memory 22 as a branch target. This ensures the replacement of the L1 cache block does not impact the integrity of the control flow information. At the same time, it also uses that BN2X to address the AL2, increases the count number 75 in the entry of 40 according to the number of times of replacing BN1X with BN2X in the track table 20 described above, in order to record the increased correlation of the L2 cache block; and set the valid bit 77 of the entry of 40 corresponding to the replaced L1 cache block (pointed to by the field 64 of the BN2X address) to be ‘0’ (invalid). After which the pointer 74 moves in a single direction and stays on the next row that satisfies the least correlation; when the pointer goes out of the boundary of all the rows in the CT 37, it wraps back to another boundary (e.g., if it exceed the largest address row then it begins the least correlation checking from the least address row). The one-way movement of the pointer 74 ensures that the L1 cache block which was previous replaced earliest (oldest block) can be the first candidate of replacement, which is what ER above means. The detection of the count number 75 of each row and the one-way movement of the pointer 74 implement the LCER L1 cache replacement strategy. This replacement method replaces a singular L1 cache block at a time.
It can also replace in-order or in reverse-order following the program order. For example, when a L1 cache block is replaced, the cache block pointed to by the L1 cache block number (BN1X) in the end entry of its track is also replaced, which method is called in-order replacement. Or when a L1 cache block is replaced, its previous block in program sequence is also replaced. This is called reverse-order replacement. The previous block is designated by the BN1X in a field 72 of the corresponding CT row of the L1 cache block. Or it can even begin from an L1 cache block in both in-order and reverse-order to replace. It can continuously replace in-order or reverse-order until it encounters a L1 cache block, which corresponding count number 70 in the corresponding table 37 exceeds the preset value. This replacement method replaces a plurality of L1 cache blocks at a time. The singular replacement method or the plural replacement method may be used as desired. Different methods can also be used in combination. For example, in normal cases use the singular replacement method, and when the low-level cache lacks replaceable cache blocks use the plural replacement method.
The replacement of the L2 cache is also based on the LCER strategy. In addition to the operation that setting the corresponding field 77 in the AL2 40 to ‘0’ and increase the count number 75 when the L1 cache block is replaced; when the cache block is stored from L2 cache memory 42 into L1 cache memory 22, the corresponding valid bit 77 in the corresponding entry in AL2 40 is set to ‘1’, and the L1 cache block number (BN1X) is written to the corresponding field 76. Each time when the BN2X obtained by matching the branch target address is stored into track table 20, the count number 75 in the AL2 corresponding to that BN2X is increased by ‘1’; each time when the BN2X in the track table entry is replaced by BN1X, the count number 75 in the AL2 corresponding to that BN2X is decreased by ‘1’. In this case, the count number 75 records the number of times that an L2 cache blocks to be used as branch target; and each valid bits 77 in the entry records whether a portion of the L2 cache block has been stored in the L1 cache; each 76 field records the block address 68 of each corresponding L1 cache block. The L2 cache replacement makes the L2 pointer 78, which is shared by all L2 cache blocks, to move in one-direction and stay on the next replaceable L2 cache block. The replaceable L2 cache block can be defined as whose count value 75 and all of its 77 fields of the corresponding AL2 40 entry are ‘0’. That is, a L2 cache block is replaceable when none of the instructions in the L1 cache 22 is a part of the L2 cache block. And the moving pointer 78 moving in one-direction assures the ER afore described.
The replacement of the L3 cache is also based on the LCER strategy. When the cache block is stored in the L2 cache memory 42 from the L3 cache memory 52, the corresponding valid bit 81 in the corresponding entry in the AL3 50 is set to ‘1’, the L2 cache block number BN2X is written to the corresponding field 80. The count number 79 in the entry of the AL3 50 is not used in the present embodiment. The L3 cache is a set-associative organization, in which each set (with same index address) has a plurality of ways, and each way in the same set uses a common shared pointer 82. It is also possible to find the next replaceable way by the pointer 82, where the replaceable way can be the way whose all fields 81 are ‘0’. That is, the L3 cache block correlate to none of the instructions in the L2 cache 42 and can therefore be replaced. Other method as well as the one-direction moving pointer can be used to ensure the recently replaced L3 block is not replaced soon.
In the present embodiment, the L3 cache is a set-associative organization. If it encounters a set that each of its way is not replaceable (each way in the AL3 50 has at least one field 81 being ‘1’), it then selects the L1 cache block which contains the least field 81 being ‘1’ to do the plural replacement. If a way contains only one field 81 of value ‘1’, that is, only one of the four L2 instruction blocks that can be stored in the L3 cache block is in the L2 cache memory 42, so that the BN2X in the field 80 corresponding to that field 81can be output to address the AL2 40, and use it to read the BN1X number in the first valid field 76 (whose 77 field is ‘1’), and calculate out there are N L1 cache blocks from this L1 cache block to the last valid L1 cache block in the L2 cache block. The BN1X and the number of L1 cache blocks N is sent to the L1 cache replacement logic, and N L1 cache blocks are replaced from the L1 cache block pointed to by the BN1X, and the cache blocks that use these cache blocks as target are together replaced, and then the L2 cache block can be replaced. Then all the fields 81 in the above-mentioned way-set in the AL3 50 are ‘0’, and the corresponding L3 cache block can be replaced. If the L1 cache blocks contained in the L3 cache block is not continuous, then according to the above method to set plural starting points and plural corresponding N values to send to the L1 cache replacement logic to replace in order.
In the embodiment of
Refer to
Each track in the L2 track table 88 corresponds to a L2 cache block in the L2 cache 42. Each L2 track contains four L1 tracks, each of which corresponds to a level 1 instruction block in the L2 cache block. The track entries of the L1 tracks in the L2 track table 88 are also in the formats of SBNY 15, type 11, BNX 12 and BNY 13 in
When a level 1 instruction block in the L2 cache block of the L2 cache memory 42 is stored into a L1 cache block in the L1 cache memory 22, the L2 track table 88 outputs the corresponding L1 track via the bus 89 to store into the track table 20. If the address in the entry on the track is in the BN3 address format, then use that address to access the AL3 50, and if the AL3 entry bit 81 is invalid, then according to the above method store the L2 cache block from the L3 cache 52 into an L2 cache block in the L2 cache 42, and combine the L2 cache block number and the L2 sub-address 64 within the BN3 address into a BN2X address, then store that BN2X address into the field 80 of the AL3. If the AL3 entry is valid, then store the BN2X in the entry into the L2 track table 88 to replace the original BN3X address. The BN2X is also bypassed to bus 89 to store into track table 20. The present embodiment uses the count number 79 in the AL3 50. Similar to the use of the count number 75 in the AL2 in the embodiment of
The BN2 address on bus 89 is also used to address the AL2 40, if the valid bit 77 of the entry in 40 is invalid, then the BN2 address is stored in the entry of track table 20; if the valid bit 77 of the entry in 40 is valid, then combine the BN1X address of the entry of 40 and the BNY address of the BN2 address to store into the entry of track table 20. When the BN2 address is output from the track table 20 via the bus 29, it is used to address the AL2 40, and if the valid bit 77 in the entry is invalid, then use that BN2 address to access the L2 cache memory 42 to read an L1 cache block and store it into an L1 cache block in the L1 cache memory 22, and that L1 cache block number BN1X is stored in the field 76 of the AL2 40, and the BN1X is stored in the track table 20, the BN1X can also be bypassed to bus 29 for use by the tracker 47. The address of the track entry in the AL2 88 in this embodiment can be in the BN3 or BN2 format. The address of the track entry in the active list 20 may be in the BN2 or BN1 format. Another strategy is to fill in the track table 20 with BN1 address only, when the address on the bus 89 is of BN2 format, use it to address an AL2 40 entry. If the bit 77 of the entry is invalid, then use that BN2 address to access the L2 cache memory 42 to read out a level 1 cache block and store the block into a L1 cache block in the L1 cache memory 22, and stores that L1 cache block number BN1X of that L1 cache block in the 76 field of the AL2 40, and set its corresponding field 77 to valid; also store the BN1X in the track table 20, and the BN1X can also be bypassed to the bus 29 for use by the tracker 47. If 77 in 40 is valid, then use the BN1X of the field 76 of the entry to directly fill in the track table 20 and bypass it to the bus 29 for use.
Refer to
When the field 13 is ‘invalid’, or it is ‘valid’ but the base address on the bus 93 is not matched with the content in the register 95, the selector 98 selects the BN1 address on the bus 89 to output via the bus 99. When the type of the entry on bus 29 is an indirect branch instruction, the address of bus 99 is used by tracker 47; when the entry type on bus 29 is other type, the address on bus 29 is selected for use by tracker 47. The next time the same indirect branch instruction is executed, the register set number in the field 13 in the track table entry on the bus 29 selects the corresponding register set 95 and 96, and the RF address in the field 12 selects the data on bus 94 that is written back to that RF entry to compare with the content in register 95, if it is matched, then the BN1 address in the corresponding row of register 96 is output via bus 97, and selected by the selector 98 for use by the tracker; if it does not matched, then according to the above said method use the adder 93 to calculate the indirect branch target address to match into BN1 address and put it on bus 89, then the selector 98 select the address on bus 89 to output. The mismatch also causes the base address on bus 94 and the BN1 address on bus 89 to be stored in a row that is not used in registers 95 and 96. The replacement logic is responsible for allocating register set of 95, 96 for entries of indirect branch type in bus 29 whose field 13 is ‘invalid’, in the method of an LRU or the like. Thus, the present embodiment can map the base address of the indirect branch instruction to the L1 cache address BN1, eliminating the step for address calculation and address mapping.
Refer to
Refer to
The structure of the L2 CT 103 is similar to the CT 37. Wherein each L2 cache block has a count number, a L3 cache address corresponding to this L2 cache block, source addresses of the branch source instructions targeting this L2 cache block, and their corresponding valid signals (Refer to the CT format of
When the entry address format of the output 29 from the track table 20 is in BN2 format, using the BN2 address to address the level 2 active list (AL2) 40. If the corresponding entry is invalid, then using the BN2 address (hereinafter referred to as the source BN2 address) to read instruction block from the L2 cache memory 42 to fill in a L1 cache block in the L1 cache 22 selected by the replacement logic. Then, using the source BN2 address to address the level 2 track table 88 to output the corresponding track to be stored in the track table 20. When the output 89 of the 88 is in the BN3 address format (hereinafter referred to as the target BN3 address), the target BN3 address is sent to the level 2 active list (AL3) 50 to be mapped into the BN2 address (hereinafter referred to as the target BN2 address), at this time the count number in the level 2 active list (AL3) entry pointed to by that target BN3 is decreased by ‘1’, while the value in the target row in the L2 CT 103 pointed to by the target BN2 address is increases by ‘1’; the target BN3 address is stored in the same target row; and the source BN2 address is also stored in the same target row, and the corresponding valid bit is set to ‘valid’.
When a L2 cache block is replaced, the level 2 pointer 78 points to the corresponding target row of the replaceable L2 cache block in the L2 CT 103, reading out the valid BN2 source addresses and using the BN2 source addresses to address the level 2 track table (TT2) 88 entries and replacing the BN2 target addresses (pointing to the target row) in the entries by the BN3 target address stored in the target row in 103, and set the valid bits of each BN2 source address in the target rows in 103 to ‘invalid’. Subtracts the Count number in the target row in 103 by the number of the valid BN2 source address. Using the afore mentioned BN3 target address to address the level 3 active list (AL3) 50 entry, increase its count number 79 by the same value that the count number in 103 subtracts.
The above-mentioned cache replacement method is based on the inclusive cache, that is, the content of higher level cache must be in the lower level cache. I The least correlation cache replacement method can also be applied to non-inclusive caches. It is possible to add a lock signal bit to the correlation table corresponding to the high-level cache block. When the lock signal bit is ‘0’, the operation is the same as the above. When the lock signal bit is ‘1’, the corresponding cache block can be replaced only when its correlation degree is ‘0’, that is, when there is no branch instruction targeting at that cache block (here, the end entry of the previous instruction block is also treated as an unconditional branch instruction). In the correlation table 37, for an L1 cache block whose lock signal bit is it can be replaced only when its corresponding count number 70 is ‘0’ and all valid bits 73 are ‘0’. In the level 2 correlation table (CT2) 103, the L2 cache block whose lock signal bit is ‘1’ can be replaced only when its corresponding count number and all valid bits are ‘0’.
For example, when replacing the L3 cache block of one way in one set of the L3 cache, using the BN3 address on the L3 pointer 83 to address the entry in the level 3 active list (AL3) 50, using all valid BN2 addresses within the entry to address the rows of the level 2 correlation list (CT2) 103 to set their lock signals to ‘1’. The L3 cache block can then be replaced. After the replacement, the cache is working in non-inclusive mode. The corresponding L3 cache block of the L2 cache block whose lock signal is set to ‘1’ has already been replaced, so it is not possible to maintain the integrity of the control flow information by replacing the BN2 address in the entry of the level 2 track table (TT2) 88 by the corresponding BN3 address. It needs to wait until the correlation degree of the L2 cache block of the is ‘0’, then the L2 cache block can be replaced.
A cache is in exclusive organization, If the low level cache is replaceable when all high-level caches are assumed to have a lock signal of ‘1’, that is, the high-level cache block can only be replaced when the correlation degree is ‘0’; or when the valid bits of all high-level cache sub-blocks in a activate list entry corresponding to one cache block (for example, the 81 in AL3 50) are all ‘1’, and the count number in the entry (for example, the 79 in 50) is ‘0’, then the cache block is replaceable. It is also possible to set the cache replacement method that a cache block in each cache level can be replaced when the correlation degree is ‘0’.
The 102 in
The method in the exemplary embodiment of
The structure in the exemplary embodiment of
A specific exemplary embodiment of the first application may be using a flash memory as the memory 111, and a DRAM as the memory 112. The flash memory has larger capacity, lower cost, but long access latency, and limited number of writes. DRAM memory has smaller capacity, higher cost, but lesser access latency, and the unlimited write times. Thus, the structure of the exemplary embodiment of
The second application of the exemplary embodiment of
Assuming the address of this specific exemplary embodiment is in the format of
When the operating system controls the processor in
After that, or when 61 of the starting point address matches the contents of the tag in the tag unit, the system controller uses the way number 65, the index 62 of the starting point address, the L2 sub-address 63 to read an L2 instruction block from the memory 112 (main memory), and the L2 instruction block is then stored in the L2 cache memory 42 and in an L2 cache block selected by the L2 block number 67 given by the L2 cache replacement logic; and that L2 block number 67 is stored into the entry 80 pointed to by the above 65, 62, and 63 in the AL3 50 and the corresponding valid bit 81 in the entry is set to ‘valid’. The scanner 43 scans that L2 instruction block, extracts the branch instruction information therein, and generates the track to store into the L2 track table 88.Thent, the system controller further uses the combination of the L2 block number 67 and the L1 sub-address 64 in the starting address to read an L1 instruction block in 42 and stores that L1 instruction block into a L1 cache block in the L1 cache memory 22, which is pointed to by the L1 block number 68 produced by the L1 cache replacement logic; the corresponding track in the L2 track table 88 is also stored into the track table 20, in the process, the address of BN3 address on the track is replaced with BN2 as described above; that L1 block number 68 is also stored into the AL2 40 and into the entry 76 pointed to by the above 67 and 64, and set the valid bit 77 of the entry to be ‘valid’. At last, the system controller combines the above L1 block number 68 with the L1 block offset BNY 13 to be the BN1 address and put it into the register 26 of tracker 47, and makes the read pointer 28 to point to the starting point instruction in the L1 cache memory 22 and also point to the corresponding entry in track table 20. The subsequent push operation to the processor core is similar to the previous exemplary embodiments. In general, the new thread starting point address injected by the operating system, or the hard disk address generated by the scanner 43 or the indirect branch address generator 102 is selected by the selector 54 and then sent to the tag unit in 51 to match. If the match is successful, use the matched BN3 address to address the LA3 50. If the output entry of the 50 is ‘valid’, then use the BN2 in the entry to address the AL2 40. If the entry outputted from the 50 is ‘invalid’, then use the above BN3 address to directly address the memory 112 (main memory) to output L2 instruction block to the L2 cache memory 42. If the matching in the tag unit 51 of the hard disk address is not successful, then address the memory 111 (hard disk) via bus 113, and read the corresponding instruction block (page) to store into the memory 112 (main memory) and into the main memory block selected by the cache replacement logic and replace the previous instruction block in it. This replacement from the hard disk to the main memory is totally controlled by the hardware, and do not need to invoke the software operation in general. The replacement logic can use a variety of algorithms such as LRU, NRU (not recently used), FIFO, clock, etc.
If the address space of the above hard disk address is greater than or equal to the address space of the memory 111, there is no need for the TLB in 51 of the exemplary embodiment of
The memory 111 and the memory 112 of the exemplary embodiment of
Refer now to
The lowest level 111 of the memory hierarchy in the exemplary embodiment of
When an instruction block is transmitted from the memory 122 (L4 cache memory) via the bus to the L3 cache memory 112, the scanner 43 extracts the information of the branch address in the instruction block, generates the track entry type, and calculates the branch target address. The branch target address is selected by the selector 54 to send to 51 to match with the tag unit. If not matched, then use the branch target address via the bus 113 to address the memory 111, read the corresponding instruction block to store in the memory 122 and in the L4 cache block selected by the replacement logic of the L4 cache (the AL4 129 and L4 CT 121, etc.) If matched, then use the matched BN4X address 123 to address the AL4 120. If the 120 entry is valid, then combine the BN3X address in the entry and the BNY of the branch target address into the BN3 address and store the BN3 via the bus 125 into the entry of the L3 track table 118 which is corresponding to that branch instruction; if the 120 entry is invalid, then directly combine the BN4X address and the BNY address into the BN4 address to store into the 118 entry.
Refer to
When the L2 instruction block is transferred from the L3 cache memory 112 to the L2 cache memory 42, the corresponding track is read via the bus 119 from the L3 track table 118, and the BN4 format address in the track entry is used to address the AL4 120; if the entry 120 is valid, use its BN3X address to fill in the track table entry of 118 and bypass the BN3X to bus 119 to store into the corresponding entry in the L2 track table 88; if the entry 120 is invalid, then use the BN4 address on the bus 119 to address the memory 122, and read the corresponding instruction block to fill into the memory 112 and into the L3 cache block selected by the BN3X address given by the L3 cache replacement logic (the AL3 50 and the L3 CT 117, etc.). The given BN3X address is stored in the entry of in the AL4 120 pointed to by the above BN4, and is stored in the corresponding entry in the L3 track table 118, and the BN3X address is also bypassed to the bus 119 and also stored in the corresponding entry of the L2 track table. If the output on the bus 119 is already BN3X address, then use that BN3X address to address the AL3 50. If the entry of 50 is valid, then use the BN2X address in the entry to store in the corresponding entry of the L2 track table 88; if entry of 50 is invalid, then use the BN3X address on 119 to address the memory 112, and read the corresponding L2 cache block to store into the L2 cache memory 42 and into the L2 cache block pointed to by the BN2X address given by the L2 replacement logic (the AL2 40 and the L2 CT 103); the BN2X is also stored in the L2 track table 88, and the BN2X is also stored in the entry addressed by the above mentioned BN3X in the AL3 50; and the BN2X is also stored into the L2 track table 88.
When the L1 instruction block is transferred from the L2 cache memory 42 to the L1 cache memory 22, the corresponding track is read via the bus 89 from the L2 track table 88, and the BN3 format address in the track entry is used to address the AL3 50; if the entry of 50 is valid, use its BN2X address to fill in the track table entry of 88 and bypass the BN2X to bus 89 to store into the corresponding entry in the L1 track table 20; if the entry of 50 is invalid, then use the BN3 address on the bus 89 to address the memory 112, and read the corresponding instruction block to fill into the memory 42 and into the L2 cache block selected by the BN2X address given by the L2 cache replacement logic (the AL2 40 and the L2 CT 103, etc.). The given BN2X address is stored in the entry of in the AL3 50 pointed to by the above BN3, and is stored in the corresponding entry in the L2 track table 88, and the BN2X address is also bypassed to the bus 89 and also stored in the corresponding entry of the L1 track table 20. If the output on the bus 89 is already BN2X address, then use that BN2X address to address the AL2 40. If the entry of 40 is valid, then use the BN1X address in the entry to store in the corresponding entry of the L1 track table 20; if entry of 40 is invalid, then use the BN2X address on 89 to address the memory 42, and read the corresponding L1 cache block to store into the L1 cache memory 22 and into the L1 cache block pointed to by the BN1X address given by the L1 replacement logic (the L1 CT 37, etc.); the BN1X is also stored in the entry addressed by the above mentioned BN2X in the AL2 40; and the BN1X is also stored into the L1 track table 20.
When the instruction block is pushed from the L1 cache memory 22 to the processor core 23 or the IRB 39, the corresponding track is read via the bus 29 from the L1 track table 20, and the BN2 format address in the track table entry is used to address the AL2 40; if the entry of 40 is valid, then use its BN1X address to fill in the track table entry of 20 and bypass the BN1X to the bus 29; if the entry of 40 is invalid, then use the BN2 address on bus 29 to address the memory 42, and read the corresponding instruction block to store into the memory 22 and into the L1 cache block pointed to by the BN1X address given by the L1 cache replacement logic (L1 CT 37, etc.). The BN1X address is stored in the entry of the AL2 40 pointed to by the BN2 address, and is also stored in the corresponding entry in the L1 track table 20. If the output on the bus 89 is already BN1 address, then the BN1 address is stored into the register in the tracker 47 and becomes the read pointer 28, which is used to address the track table 20 and the L1 cache memory 22, to push instructions to the processor core 23 or the IRB 39. This ensures that for the instructions in the L1 cache memory 22, their branch targets, and the fall-through cache blocks are at least already in the L2 cache memory 42 or are in the process of being stored into 42. The remaining operations are the same as described in the previous examples, and is not described here.
Although the exemplary embodiment of
Each memory level also needs data track table (DTT here after), the data active list (DAL here after), the data correlation table (DCT hereafter) and the pointers to support the store operation of the data memory. Refer to
The serving data memory hierarchy also uses the stride table 150 to record the address difference stride of two adjacent data accesses by the same data access instruction. Please refer to
Refer to
In
When an L1 instruction block is stored in IRB 39, its corresponding DRB 163 is cleared. When the decoder (the instruction decoder in the processor core 23 or the dedicated instruction decoder attached to the IRB 39 at this time) translates an instruction sent to the processor core 23 as a data load instruction, the system then allocates one row in the stride table 150 for it. The status bit 139 of the row is set to ‘0’. According to the status bits of ‘0’, the system makes the data address generated by the processor core 23 executing the data load instruction to be outputted via bus 94, bypassed by 102 via bus 46 and selector 54, to be matched in 51. If the data is not matched, then as the exemplary embodiment of
The system further uses the 65 and 62 above along with the L3 sub address 126 in the data address to read the L3 data block from the memory 122, and stores it in the L3 data cache memory 160 via the bus 115 and in the L3 cache block selected by the L3 data block number 128 given by the L3 data cache replacement logic, and stores that L3 block number 128 in the entry field in the AL4 120 pointed to by 65, 62, and 126 and sets that field to ‘valid’. And at the same time, that 65 and 62 (L4 block number) are stored into the entry of DCT 174 pointed to by the above 128. In addition, the scanner 43 calculates the address of the next L3 data block of that L3 data block (i.e., the data address plus the size of a L3 data block), and sends the address to the tag unit in 51 to match the BN4 address. Use that BN4 address to access the AL4 120 to map it into the DBN3X address, which is combined with the DBNY 13 in the data address to get the DBN3 address. The resulting DBN3 or BN4 address is stored into the field 132 of the entry in DTT3 164 pointed to by the above 128. If the next L3 data block is still in the same cache block, then add ‘1’ on the above 126, combine it with the original 65, 62 to get the next L3 data block DBN3 address in the order, without going through the tag unit in 51 for mapping. Alternatively, the next L3 data block may also be filled into the L3 data cache memory 160 and the corresponding entries in the 120 and 174 are filled as described above; generally, the L3 data block after the next L3 data block is not need to be filled in 160.
The system further uses the 128 above along with the L2 sub address 63 in the data address to read the L2 data block from the DL3 160, and stores it in the L2 data cache memory 161 and in the L2 cache block selected by the L2 data block number 67 given by the L2 data cache replacement logic, and stores that L2 block number 67 in the entry field in the AL3 167 pointed to by 128, 63, and sets that field to ‘valid’. And at the same time, that 128 (L3 block number) is stored into the entry of DCT2 175 pointed to by the above 67. Alternatively, add ‘1’ to the above 63, and combine it with 128 to address the AL3 167, and if the entry is ‘valid’, it means the next L2 cache block is already in the L2 cache; if the entry is ‘invalid’, then from the DL3 memory 160 use that combined address of 128 and the 63 plus ‘1’ to read the L2 data block, and store the L2 data block into the DL2 memory 161 and into the another L2 data cache block pointed to the L2 cache block number 67 given by the L2 cache replacement logic, and store that another 67 into the entry pointed to by the address of the combination of the 128 and 63 plus ‘1’, and set that entry to be ‘valid’.
If the address of the next L2 data block exceeds the boundary of the L3 cache block, the entry pointed to by the 128 in the DTT3 164 is read out via the bus 190, and if the content of the entry is in the BN4 format, then use the BN4 address to access the AL4 120 via bus 197. If the entry of 120 is valid, then use the DBN3 address in the entry to store into the entry of 164 pointed to by the 128 and replace the original BN4. If the entry of 120 is invalid, then use that BN4 address on bus 197 to access the memory 122 to read the next L3 data block to store in the memory 160, and the corresponding entries 164, 167, 174, and 120 are filled in the method described above. This ensures that when the content of a L3 data block is stored into the L2 data cache, the next L3 data block is stored in the L3 data cache. Alternatively, when the entry of DTT3 164 pointed to by the above 128 is in the DBN3 format, use the DBN3 to address the AL3 167 via bus 190 as described above, in order to make the next L2 data block of the L2 data block being filled now is also filled into 161. Also, it can store the last data block into the data cache according to the need, and at this time, it uses the field 130 of the track table. It is also possible to completely not use the data track tables 164, 165, 166. At this point the system does not have the function to automatically fill the next or previous L2 data block that exceeds the L3 data cache boundary. The prefect of the other data memory levels is done in the same way.
The system further reads the L1 data block from the L2 data cache memory 161 using the combination of the above 67 with the L1 sub address 64 in the data address, and stores the L1 data block into the L1 data cache memory 162 and into the L1 data cache block pointed to by the L1 data block number 68 given by the L1 data cache replacement logic; and stores that L1 data block number 68 into the entry field of the DAL2 168 pointed to by the 67, 64 and sets the field to ‘valid’. At the same time, the 67 (L2 block number) is stored in the entry in the DCT 176 pointed to by the above 68. Alternatively, the entry of DTT2 165 pointed to by the above-mentioned 67 is read out. If the content of the entry is in the BN3X format, then use that BN3 address to access the DAL3 167 via the bus 185, and if the entry of 167 is ‘valid’, then use the BN2X address in the entry of 167 to write back to the 165 via bus 189 to replace the BN3X address. If the entry of 167 is ‘invalid’, then use the address on 185 to address the DL3 160 to read the L2 data block to store into the DL2 161 and into the L2 cache block pointed to by another L2 cache block address 67 given by the cache replacement logic. That another 67 is also stored in the entry of DAL3 167 addressed by the 185, which is also stored in the DTT2 165 to replace the BN3X address. Also, use that address 67 to establish corresponding entries in the DAL2 168 and the DCT2 175 for the above L2 cache block. This ensures that when the content of a L2 data block is stored in the L1 data cache, the next L2 data block is stored in the L2 data cache.
The system further uses the above 68 and the DBNY 13 in the data address together to be the L1 data cache address DBN1 and stores it into the field 138 of the row of stride table 150 corresponding to the above load data instruction, and sets the status field 139 of that row to ‘1’. According to the status of ‘1’, the system uses the above DBN1 to access the L1 data cache memory 162, reads the data and stores it into the entry in the DRB 163 corresponding to the above data loading instruction, so that the data can be pushed to the processor core 23 for processing along with the instruction. When the data is pushed to the processor core 23, the system starts perfecting the next data into the DRB for the prefect when the same data load instruction is executed again. Because the state field 139 is ‘1’, the process of perfecting data for the push is exactly the same as described above, except that when generating the new 68 and 13 (DBN1), firstly subtract the DBN1 with the last DBN1 in the field 138 of that row in the stride table 150, and uses the difference to be the stride to store into the entry selected by the branch decision, such as in 140. After that, write the new DBN1 into field 138 to replace the old address, and set the status field 139 to ‘2’.
When the second data is pushed to the processor core 23, and when a branch instruction after the data load instruction has its branch decision to be ‘take the branch’, the system starts perfecting the next data into the DRB for the push of the next execution of the same load data instruction. For the state field 139 is ‘2’ at this time, the system no longer waits for the processor core 23 to calculate the data address. Instead, it directly outputs the DBN1 address in the field 138 in the row of the stride table 150 corresponding to the load data instruction and the branch stride selected by the branch decision (e.g., 140), and adds them in the adder 173. The system also makes a boundary check on the output 181 of 173. If the 181 does not exceed the boundary of the L1 data cache block, the selector 192 selects 181 to access the L1 data cache memory 162, and reads out the data to store in the corresponding entry in the DRB for push. And the address on 181 is stored as DBN1 in the corresponding row in the stride table. If 181 exceeds the boundary of the L1 data cache block but does not exceed the adjacent L1 cache block boundary, then use the 181 to address the DTT1 166, read the DBN1X address 132 of the next L1 data block (or the DBN1X address 130 of the previous data block) to output via bus 191, the DBN1X address is selected by the selector 192 and combined with the DBNY address 13 on 181 to access the memory 162, and read the data to store into the corresponding entry in DRB for push. And store the above-mentioned combined address DBN1 in the field 138 of the corresponding row in the stride table 150. In both cases, the status field 139 in 150 remains unchanged for ‘2’. If the address 132 of the output of 166 is in the BN2X format, the system will use the BN2X address to access the DAL2 168 via the bus 181. If the entry of 168 is valid, then use the BN1X address in the entry of 168 via bus 184 to write back the 166 to replace the BN2X address. If the entry of 168 is ‘invalid’, then use the address on 191 to address the L2 data cache memory 161 to read the L1 data block to store into the L1 data cache memory 162 and into the L1 cache block pointed to by the L1 cache block address 68 given by the L1 cache replacement logic. The 68 is also stored in the entry of DAL2 168 addressed by the 191, and is also stored in the DTT1 166 to replace the BN2X address.
If the 181 is out of the above boundary but does not exceed the L2 cache block boundary, the system uses the DBN1 address 138 to address the DCT1 176, and map the DBN1 address to DBN2 address and output it via the bus 182. The adder 172 adds the DBN2 addresses on 182 with the stride 140,and use the output of the adder to address the DAL2 168. If the entry is valid, then combine the DBN1X address in the entry and the DBNY 13 on 183, and use the combined address to access the L1 data cache memory 162 via bus 184, and read the data to store into the entry in DRB for push; and store the DBN1 address on 184 into the field 138 of the corresponding row of the stride table 150, and remain the field 139 to be ‘2’ unchanged. If the entry of the DAL2 168 is invalid, then use the 183 to address the L2 data cache memory 161 to read the L1 data block and store it into L1 cache memory 162 and into the L1 cache data block pointed to by the L1 data block number 68 given by the L1 data cache replacement logic. The system also combines the 68 and the DBNY on 183 to be the DBN1 address to access 162, and read the data to store into the DRB entry for push; and the DBN1 address is stored into the field 138 of the corresponding row of the stride table 150, and remain the field 139 to be ‘2’ unchanged.
If the 181 is out of the L2 cache block boundary but does not exceed the L3 cache block boundary, the system uses the DBN2 address on bus 182 to address the DCT2 175, and map the DBN2 address to DBN3 address and output it via the bus 186. The adder 171 adds the DBN3 addresses on 186 with the stride 140,and use the output 188 of the adder to address the DAL3 167. If the entry is valid, then combine the DBN2X address in the entry and the DBNY 13 on 188, and use the combined address to access the DAL2 168 via bus 189. If the entry of 168 is ‘valid’, then directly combine the DBN1X address with the DBNY 13 on bus 188 to be the DBN1 address, and use it to access the L1 data cache memory 162 via bus 184, and read the data to store into the entry in DRB for push; and store the DBN1 address on 184 into the field 138 of the corresponding row of the stride table 150, and remain the field 139 to be ‘2’ unchanged. If the entry of the 168 is invalid, then use the DBN2 address on bus 189 to address the L2 data cache memory 161 to read out the L1 data block, and store it into the L1 data cache memory 162 and into the L1 cache block pointed to by the L1 data cache number 68 given by the L1 data cache replacement logic; and the 68 is also stored into the entry of 168 which is addressed by the bus 189, and the entry is set to ‘valid’. The system also combines the 68 and the DBNY on 189 to be the DBN1 address to access 162, and read the data to store into the DRB entry for push; and the DBN1 address is stored into the field 138 of the corresponding row of the stride table 150, and remain the field 139 to be ‘2’ unchanged.
If the 181 is out of the L3 cache block boundary but does not exceed the L4 cache block boundary, the system uses the DBN3 address on bus 186 to address the DCT3 174, and map the DBN3 address to DBN4 address and output it via the bus 186. The adder 170 adds the DBN4 addresses on 196 with the stride 140, and use the output 197 of the adder to address the DAL4 120. If the entry is valid, then combine the DBN3X address in the entry and the DBNY 13 on 197, and use the combined address 197 to access the DAL4 120. If the entry of 120 is ‘valid’, then combine the DBN3X address with the DBNY 13 on bus 197, and use the combination to access the DAL3 167 via bus 125. If the entry of 167 is ‘valid’, then directly combine the DBN2X address in the entry with the DBNY 13 on bus 125 to be the DBN2 address to access the DAL2 168 via bus 189. If the entry of 167 is ‘invalid’, then use the DBN2 address on bus 189 to address the L2 data cache memory 161 to read out the L1 data block, and store it into the L1 data cache memory 162 and into the L1 cache block pointed to by the L1 data cache number 68 given by the L1 data cache replacement logic; and the 68 is also stored into the entry of 168 which is addressed by the bus 189, and the entry is set to ‘valid’. The operations that the system uses the DBN2 address on bus 189 to access the DAL2 168 and following are the same with the description of the previous paragraph. Finally, the system uses the DBN1 address to access 162, and read the data to store into the DRB 163 entry for push; and the DBN1 address is stored into the field 138 of the corresponding row of the stride table 150, and remain the field 139 to be ‘2’ unchanged.
If the 181 exceeds the L4 cache block boundary, the system uses the BN4 address on bus 196 to address the tag unit of 51 to read the corresponding tag 61 and sends it to the adder 169. 169 adds the tag 61 with the stride 140, and its sum 198 is selected by the selector 54 and then sent to the label unit in the 51 to match. If the matching generates a new BN4 address, then use that new BN4 address to address the AL4 120 via bus 123. If the entry is ‘valid’ in 120, then use the DBN3X address in the entry to address the DAL3 167 via bus 125. The subsequent operation that addressing 167 via the bus 125 is the same as that of the previous paragraph. If the entry of 120 is ‘invalid’, then use the new BN4 address on bus 123 to address the memory 122 to read out the L3 data block and store it into the L3 data cache memory 160, and the operations are as described above. If there is no match in the tag unit, then put the address from bus 198 to bus 113 to address the memory 111 to read out the L4 data block and store it into L4 cache memory 122. The process has been described above in this exemplary embodiment and will not be repeated. Finally, the system uses the DBN1 address, which is obtained by the mapping of each level of active list, to access 162, and read the data to store into the DRB entry for push; and the DBN1 address is stored into the field 138 of the corresponding row of the stride table 150, and remain the field 139 to be ‘2’ unchanged. If in the process the corresponding data block in a memory hierarchy level does not exist, the system will automatically read the data block from the lower memory level into the cache block specified by the cache replacement logic in the present hierarchy. The cache block address is also stored into the lower level active list, and the lower level cache block number is stored in the correlation table to establish a two-way mapping relationship.
The above describes the push process for data loading. Data memory can be done in a similar way, or by conventional methods such as storing into write buffers and when the data cache is idle, the data in the write buffer is written back to the data cache. When using the stride in the stride table 150 to guess and load the data (that is, when the field 139 is ‘2’ in 150), it is necessary for the processor core to send the correct data address via the bus 49 to compare with the guessed DBN1 address. If they are different, it is necessary to discard the data of the loaded data and its subsequent execution result, load the data with the correct data address on the bus 49, and set the corresponding field 139 to ‘0’ and recalculate the stride to store into 150. If there is a write buffer, then the loaded guessed address is also need to be compared with the address in the write buffer to make sure that the loaded data is the updated data. The DBN address can be mapped to a data address to be compared with the data address on 49. Also, it can map the address on 49 to the DBN address to compare with the DBN address that is generated by the system guessing. In addition, if the valid bits of the stride determined by the branch decision (such as 141) are “invalid” in the stride table 150, it also needs to generate the stride under that branch decision as described above.
The lowest level cache of the data cache hierarchy in the exemplary embodiment of
The following description will be made with reference to
If the sum 181 exceeds the boundary of the L2 cache block 203, the DBN1 format of the 138 address needs to be mapped to the DBN2 format 182 via the DCT1 176, and the DBN2 format is mapped to the DBN3 format by the DCT2 175. The DBN3 format address is then added with 140, and the sum 174 is used to address the DAL3 167 to get the corresponding entry of the L3 cache block 201, to read the DBN2 address 189 of the L2 cache block 204. Then use the 189 to address the DAL2 168 to get the address DBN1 of the L1 cache block 207. Then it can use that address to address the L1 cache memory 162 via bus 184 to read data to store into DRB 163, and store that address into field 138 of 150. If the sum 181 exceeds the boundary of the L3 cache block 201, then map the DBN1 address in 138 to DBN2 format address via 176, and map it to DBN3 format via 175, and map it to BN4 format via 174; then use that BN4 address to access AL4 120 to obtain the DBN3 format address 125; use that DBN3 address to access DAL3 167 to obtain the DBN2 address 189; use that DBN2 address to access DAL2 168, to obtain the address DBN1 of the L1 cache block 207. Then it can use that address to access L1 cache memory 162 via bus 184 to read out the data to store into DRB 163 and store that address into the field 138 of 150.
The cache blocks at each level in the data cache hierarchy in the embodiment of
Please refer to
The operation is similar to that of the embodiment of
Please refer to
The learning engine 226 is responsible for generating entries for the data track table (DTT). 230-232 is an entry in DTT 166 corresponding to data 220-222 of 162. Each entry in 166 has a ‘valid bit’, where the data type entry 230 corresponds to the data entry 220, and the pointer entries 231 and 232 respectively contain the address pointers in 221 and 222 in the DBN format. Data type entries, and pointer entries each has their identifiers to distinguish them. The DBN format can address the data memory 162 directly.
The data read pointer 181 controls to read a row of track from the data track table 166. If the DBNY value in the pointer is close to the end of a row, then it according to the BN address at the end track point of that row, reads out the next row of the address order, and sends them to the shifter 225. In 225, that one row of track or two rows of tracks are shifted left, in the amount of the DBNY value in the data read pointer 181. The learning engine 226 receives the shifted plurality of entries, determines the data type entry 230 according to the identifiers in these entries. 226 also determines the operation to the pointer entries 231 and 232 according to the data type in the data type entry 230. The comparison result 228 generated by the processor core 23 controls the selector 227 to select one out of a plurality of pointers output from 226 to put onto the data read pointer 181 to address the data memory (DL1) 162 to serve data to the processor core 23.
For example, the data value in the entry 220 in the data memory 162 is ‘6’, the entry 221 contains the 32-bit address and the entry 222 contains the 32-bit address ‘R’. Correspondingly the data type in entry 230 of the data track table 166 is the binary tree, and the control signal is the comparison result 228 generated by the processor core 23 executing the instruction of the address ‘YYY’; the 231 contains the DBN format address pointer ‘DBNL’ obtained by address mapping of the address pointer in 221; the 232 contains the DBN format address pointer ‘DBNR’ obtained by address mapping of the ‘R’ address pointer in 222. The learning engine 226 checks the plurality of entries from the shifter 225 and selects the data type entries 230 based o the identifiers. According to the binary tree data type in 230, the 226 outputs the entries 231 and 232 from the shifter 225 to the two inputs of the selector 227. Assuming that the instruction with the instruction address ‘YYY’ compares the value to be searched ‘8’ with the value ‘6’ of 220 loaded from (DL1) 162 into 23, and the result of comparison is ‘1’, which means that the searched value is greater than the value in the current node 220. 226 observes the address 28 that controls the L1 memory 22, when it reaches ‘YYY’, it makes the comparison result 228 from the processor core to control the selector 227. 228 at this time controls the 227 to select the right branch pointer ‘DBNR’ in the 232 to be outputted to the data read pointer 181. If the valid bit in 232 is ‘valid’, then the data pointed to by the right branch pointer in 232 becomes the new current data. The selector 192 selects the 181 to address the DL1 162 to output the new current data to store into DRB 163. The 181 also addresses DTT 166, making 166 to output the corresponding data track that contains the new current data to the shifter 225. The block offset part DBNY in the address on 181 controls the shifter 225 to shift the data track to the left, make the data type, the DBNL address, the DBNR address (as in the formats of 230, 231, 232) align to the input of the learning engine 226.
Each entry of DRB 163 corresponds to a block offset address (DBNY), and the 162 (DL1) stores the entire data block into 163 (if the data specified by data type 230, such as 220-222, exceeds one data block, then move the data beginning from the ‘DBNR’ address to the next data block in address order). The processor core 23 uses the offset part of the data address 94 generated by executing the load instruction to address the DRB 163, reads the current data and its left branch address pointer and the right branch address pointer (format of 220, 221, 222). The processor core 23 executes the instruction, compares the searching value ‘8’ with the current data, and generates the comparison result 228.
The learning engine 226 monitors the address 28, the comparison result 228 generated by the processor core 23, the data address 94, and the corresponding data 223 output by the (DL1) 162 to generate a data track entry to store into the DTT 166. When the entry in the corresponding 166 is ‘invalid’ (not established), the data cache system sends the data address 94 generated by the processor core 23 to the tag unit 51 (not shown in the figure), etc., to be matched and mapped to the DBN address 184. 184 addresses the data memory 162, reading the data to send to the processor core 23 via 223. The learning engine 226 records the address on 94, and the data on the 223 output from the entry in the data memory 162 addressed by that address. 226 also compares the newly generated data address 94 with the previously recorded data on 223, and if they are the same, Then the learning engine 226 matches and maps the newly generated data address 94 to the DBN, and stores the DBN into entries of DTT 166 and sets these entries to be ‘valid’. The DTT entries are those corresponds to the data entries read our on bus 323. That is, store the ‘DBNL’ obtained by matching and mapping the address pointer in 221 into the 231, and store the ‘DBNR’ obtained by matching and mapping the address pointer ‘R’ in 222 into the 232. Alternatively, 226 may also record and compare the mapped BN format data and address.
226 determines that the data memory 162 entry that satisfies the following conditions is a ‘data’ (non-pointer) entry. The conditions are that the data address of the entry itself is only one or a few of the data lengths to the entry containing the address pointer, and that the data on 223 is never the same as the subsequent addresses on 94 in a plurality of instruction loops. The range of the instruction loop may be determined by the reverse branch instruction address and its branch target instruction address in IRB 39. The entry of DTT 166 corresponding to the ‘data’ entry in the data memory 162 is the data type entry. The learning engine stores the pattern leavened by monitoring (that when the address 28 is ‘YYY’, selecting the BN address in 231 if the 228 is ‘0’, and selecting the BN address in 232 if the 228 is ‘1’) into the data track table entry (230 here) corresponding to the ‘data’ (220 here), and sets that entry to be ‘valid’. The valid bits in the data type entry may be a plurality of bits, if it is greater than a preset value then it is ‘valid’, and if it is not greater than that preset value then it is ‘invalid’.
After the data track table entry is established, processor core 23 executes the instruction to generate the comparison result 228 which controls the selector 227 to select the address pointer, moving the data read pointer 181 along the binary tree. When a new data point is reached, according to its data type (e.g., 230), the learning engine 226 controls to read out the data and its address pointers (e.g., 220-222) of the same group from the data cache 162 and stores them in the DRB 163, ready to be read by the data address 94 generated by the processor core 23. This process prevents the delay caused by the data address 94 matching in the tag unit and the address the data memory 162 access. The access latency of DRB 163 is a single clock cycle, typically less than the access latency of 162.
Further, the data read buffer may be organized as in the embodiment of
The learning engine 226 performs a learning. The result of the learning is stored in the data track table 166 in the form of data types and address pointers. The data type read from the data track table is used to control the processing of the other entries read by the 226 itself from the data track, such as moving an entry of the input 226 to a particular 226 output, or controlling the polarity of the comparison result 228, to make the selector 227 selects the correct address pointer under the control of 228 to place into the data read pointer 181, and to address the data memory 162 to output data (e.g., 220). The data type also controls 226 to generate and output a single or a plurality of subsequent addresses (adding an increment to the correct pointer address, where increment is an integer multiple of the data word length), and to address the 162 to output other data of the same group (such as 221,222). Therefore, the data type is the control setting for the 226, for example, the IRB address or tag when the comparison result 228 is generated, the polarity of the 228, the number of the subsequent addresses that need to be generated. The learning engine 226 also compares the DBN address of the bus 181 with the DBN 184 matched and mapped from the data address 94 generated by the processor core 23. If they are different, then reduce ‘1’ to the valid value in the corresponding data type entry in the DTT 166, and put the DBN 184 obtained by mapping on the bus 181 to address the data memory 162 to read the correct data, and also address the DTT 166 to read the corresponding track table entry. The learning engine 226 relearns the 166 entries whose valid value is reduced to ‘0’.
The exemplary embodiment of
The instruction type (field 11) of the indirect branch instruction can also be subdivided to provide guidance to the cache system. There is a class of indirect branch instructions that jump to the same instruction address each time they are executed, or the instruction address generated each time is incremented by a ‘stride’ on the instruction address generated by the last execution. This type of indirect branch instructions is recorded in the track table entry 11 to be ‘duplicated’, and the stride table 150 in
Refer to
TRB 238 stores the tracks corresponding to the instruction blocks stored in IRB 39. The processor core 23 has two front-end pipelines, which are FT (Fall Through) and TG (Target). The tracker 0 (TR0) 48 provide the BNY increment 38 to control the IRB 39 to provide the FT pipeline of the processor core 23 with the sequential instruction stream, and the tracker 1 (TR1) 47 read the TG address along the track in TRB in advance. The TG address of the BN1 format addresses the L1 instruction memory 22, and the TG address of the BN2 format addresses the L2 instruction memory 42, each of which reads the TG instruction. The selector 239 selects one of the TG instruction to send to the TG pipeline of the core, based on by program sequence whether BN1 or BN2 format should be used. The system controls the selector 239 to select one TG instruction to send to the TG pipeline. The Taken signal 31 selects the output of the FT or TG front-end pipeline to send to the back-end pipeline to complete the execution. When the branch is successful, the TG instruction block corresponding to the branch instruction from L2 or L1 is selected by the selector 239 to be stored in the IRB 39. The track corresponding to the TG instruction block, and from the L2 track table (TT2) 88 or the track table (TT) 20 is also selected by the selector 237 to be stored into the TRB 238 for the TR147 to read. If the TG instruction block is read from the L2 instruction memory 42 by the BN2X address on the track, it is also stored into the L1 instruction memory 22 and into the L1 memory block pointed to by the BN1X given by the replacement logic. The BN1X is also stored in the entry in the AL2 40 pointed to by the BN2X. The BN3 format address on the track output from the L2 track table 88 is sent to the AL3 50 via bus 89 to be mapped to BN2 address (or when the AL3 entry is invalid, it addresses L3 52, reads the instruction block to store in an L2 memory block in 42, whose block address is BNX2). The BN2 address replaces the original BN3 address on the track.
By the same principle, the BN2 format address on the track output from TT288 or TT 20 or the track in the 238 TRB can be mapped to BN1 format by AL2 40 (or address L2 42 to store in 22 L1 to obtain BN1 address). In the present embodiment, the 88 TT2 stores TG address of the BN3 or BN2 format, and 20 TT stores only the addresses of the BN2 or BN1 format, and the 238 TRB allows the TG addresses of the BN3, BN2, or BN1 formats. The limit of BN address in TT2 and TT triggers the instruction moving from the low-level memory to the high-level memory, avoiding the traditional cache mechanism that the cache filling is triggered by the cache missing, and avoiding the inevitable missing in the traditional mechanism. And it also ensures that the branch target instruction is at the same cache level or the adjacent lower cache level as the direct branch instruction. Since the 47 TR1 reads the TG address on the track in advance, it can partially or completely hide the access delay of 42 L2, or 22 L1. If an instruction block has branch instructions right next to one another, it's corresponding track can be deliberately assigned with TG addresses in interleaving BN1 and BN2 formats so to hide the access delay of the 42 and 22 as much as possible. If the address read on the TRB is in the BN3 format, and if the corresponding branch is successful (take), the processor core 23 needs to wait for the BN2 address mapped from that BN3 address (the mapping process begins when the track is outputted from the 88 TT2, so that it can partially or completely hide the latency of AL3 or L3) to fill in the track in TRB 238, and after that, it can execute the branch target instructions. If the corresponding branch is unsuccessful (not take), the processor core 23 does not wait and directly execute the fall-through instruction, and the mapped BN2 format is filled into the track after it is obtained. After all of the BN3 format addresses on the track in the TRB 238 are replaced with the BN2 formats, the track is filled into the row indicated by BN1X provided by the above replacement logic in TT 20. In the present embodiment, the system may control the L2 instruction memory 42 or the L1 instruction memory 22 to provide TG instruction to the processor core according to the track outputted from the TT288 or the TT 20, while the IRB 39 provides fall through instructions for the processor core. In the present embodiment, the process of executing the next instruction block is treated as a branch, and the instruction type in the end track point (track entry) in the track is set as an unconditional branch, so that the processing is the same as the above branch processing. The methods and systems in this embodiment may also be suitable to other multi-level instruction track cache memory systems, as shown in
Back to the
The following description uses the network channel as an example. The IPv6 address is 128 bits. Assuming that the memory address is 64 bits, and the IPv6 address and memory address are combined into a 192-bit address to address the remote memory on the other end of the network. In order to support the 192-bit address, only the components 43, 51, and 113 in
The specific embodiment of the above-described application form of the structure of
When the memory 111 and the other modules in
Similarly, the tag unit in 51 can store multiple network memory addresses, for example each entry is 192-bit. But there are several ways to optimize. One is to use two tables, each entry in the table 2 stores the memory address tag and a row number of table 1, while the table 1 stores the network address. The network address of the network memory address firstly match with the content in table 1 to obtain the row number of table 2. The resulting row number of table 2 is combined with the memory address to be sent to the table 2 for matching. The matching result of the table 2 is the cache address. If no matching in table 2, then use the network memory address via bus 113 to fetch the instruction or data from the memory 111 to fill in the memory 112. The other method is to only use table 2, which stores the memory address tag and the row number of the above thread register (or thread number). At this time combine the row number of the thread register (or thread number) with the memory address to send to the table 2 for matching. If no matching in table 2, then use the thread register row number (or thread number) to address the thread register to read out the network address, combine the network address with the memory address to obtain the network memory address, which is sent to the memory 111 via bus 111 to fetch data or instruction to fill into the memory 112. So that the actual cost that need to be increased is not much.
The scanner 43 in the embodiment of
While the embodiments of the present disclosure have described only the structural features and/or methodologies of the present disclosure, it should be understood that the claims of the disclosure are not limited to the described features and processes and that the various components listed in the above exemplary embodiments are for ease of description only and may include other components, or some components may be combined or omitted. The described components may be distributed in a plurality of systems physically or virtually, and can be implemented by hardware (such as the integrated circuits), software, or the combination of hardware and software.
No matter how the technology in the field develops and what development will be gained in the future, all the replacement, adjustment and improvements are within the scope of the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
201510201436.1 | Apr 2015 | CN | national |
201510233007.2 | May 2015 | CN | national |
201510267964.7 | May 2015 | CN | national |
201610188651.7 | Mar 2016 | CN | national |
The application is the U.S. National Stage of International Patent Application No. PCT/CN2016/080039, filed on Apr. 22, 2016, which claims priority of Chinese Application No. 201510201436.1 filed on Apr. 23, 2015, and Chinese Application No. 201510233007.2 filed on May 6, 2015, and Chinese Application No. 201510267964.7 filed on May 20, 2015, and Chinese Application No. 201610188651.7 filed on Mar. 21, 2016, the entire contents of all of which are hereby incorporated by reference.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2016/080039 | 4/22/2016 | WO | 00 |