This application is related to the following U.S. Non-Provisional Applications filed concurrently herewith, each of which is a national stage application under 35 U.S.C. 371 of the correspondingly indicated International Application filed Dec. 14, 2014, each of which is hereby incorporated by reference in its entirety.
In one aspect the present invention provides a cache memory for storing 2{circumflex over ( )}J-byte cache lines where J is an integer greater than three, the cache memory comprising: an array of 2{circumflex over ( )}N sets each of which holds tags that each are X bits, wherein N and X are both integers greater than five; wherein the array has 2{circumflex over ( )}W ways; an input that receives a Q-bit memory address, MA[(Q−1):0], having: a tag portion MA[(Q−1):(Q−X)]; and an index portion MA[(Q−X−1):J]; wherein Q is an integer at least (N+J+X−1); and set selection logic that selects one set of the array using the index portion and a least significant bit of the tag portion; comparison logic that compares all but the least significant bit of the tag portion with all but the least significant bit of each tag in the selected one set and indicates a hit if there is a match; and allocation logic that, when the comparison logic indicates there is not a match: allocates into any one of the 2{circumflex over ( )}W ways of the selected one set when operating in a first mode; and allocates into one of a subset of the 2{circumflex over ( )}W ways of the selected one set when operating in a second mode, wherein the subset of the 2{circumflex over ( )}W ways is limited based on one or more bits of the tag portion.
In another aspect, the present invention provides a method for operating a cache memory for storing 2{circumflex over ( )}J-byte cache lines where J is an integer greater than three, the cache memory having an array of 2{circumflex over ( )}N sets each of which holds tags that each are X bits, wherein N and X are both integers greater than five and wherein the array has 2{circumflex over ( )}W ways, the method comprising: receiving a Q-bit memory address, MA[(Q−1):0], having: a tag portion MA[(Q−1):(Q−X)]; and an index portion MA[(Q−X−1):J]; wherein Q is an integer at least (N+J+X−1); and selecting one set of the array using the index portion and a least significant bit of the tag portion; comparing all but the least significant bit of the tag portion with all but the least significant bit of each tag in the selected one set and indicates a hit if there is a match; and when said comparing indicates there is not a match: allocating into any one of the 2{circumflex over ( )}W ways of the selected one set when operating in a first mode; and allocating into one of a subset of the 2{circumflex over ( )}W ways of the selected one set when operating in a second mode, wherein the subset of the 2{circumflex over ( )}W ways is limited based on one or more bits of the tag portion.
In yet another aspect, the present invention provides a processor, comprising: a cache memory that stores 2{circumflex over ( )}J-byte cache lines where J is an integer greater than three comprising: an array of 2{circumflex over ( )}N sets each of which holds tags that each are X bits, wherein N and X are both integers greater than five; wherein the array has 2{circumflex over ( )}W ways; an input that receives a Q-bit memory address, MA[(Q−1):0], having: a tag portion MA[(Q−1):(Q−X)]; and an index portion MA[(Q−X−1):J]; wherein Q is an integer at least (N+J+X−1); and set selection logic that selects one set of the array using the index portion and a least significant bit of the tag portion; comparison logic that compares all but the least significant bit of the tag portion with all but the least significant bit of each tag in the selected one set and indicates a hit if there is a match; and allocation logic that, when the comparison logic indicates there is not a match: allocates into any one of the 2{circumflex over ( )}W ways of the selected one set when operating in a first mode; and allocates into one of a subset of the 2{circumflex over ( )}W ways of the selected one set when operating in a second mode, wherein the subset of the 2{circumflex over ( )}W ways is limited based on one or more bits of the tag portion.
Modern processors are called upon to execute programs that process data sets having widely varying characteristics and that access the data in widely different manners. The data set characteristics and access patterns impact the effectiveness of cache memories of the processor. The effectiveness is primarily measured in terms of hit ratio.
In addition to its size, the associativity of a cache memory can greatly affect its effectiveness. The associativity of a cache memory refers to the possible locations, or entries, of the cache into which a cache line may be placed based on its memory address. The greater the number of possible locations a cache line may be placed into, or allocated into, the greater the associativity of the cache. Some programs benefit from cache memories with greater associativity and some programs benefit from cache memories with lesser associativity.
Embodiments are described in which a cache memory can be dynamically configured during operation of the processor to vary its associativity to be greater than its normal mode associativity and/or to be less than its normal associativity.
Referring now to
The memory address 104 is decomposed into three portions, each having a plurality of bits: a tag portion 112, an index portion 114 and an offset portion 116. The offset 116 specifies a byte offset into the selected cache line. The use of the tag 112 and index 114 are described in more detail below. For ease of illustration, an example memory address 104 is shown in
The cache memory 102 is designed as a plurality of sets by a plurality of ways. For ease of illustration, an example cache memory 102 is shown in
As described in more detail below, on a lookup in fat mode the index 114 is used to select two different sets of the cache memory 102 and the full tag 112 of the memory address 104 is compared against the full tag 106 of each way of the two selected sets to detect a hit; whereas, in normal mode and skinny mode the index 114 and the least significant bit (LSB) of the tag 112 are used to select one set of the cache memory 102 and the all but the LSB of the tag 112 of the memory address 104 are compared against all but the LSB of the tag 106 of each way of the selected one set to detect a hit. This doubles the effective associativity and halves the number of ways of the cache memory 102 when configured to operate in fat mode. Conversely, when operating in skinny mode, the cache memory 102 limits the ways into which a cache line may be allocated to a subset of the total ways (e.g., from 16 to 8, to 4, to 2, or to 1) based on one or more of the lower bits of the tag 112, which reduces the effective associativity by two the number of bits of the tag 112 used to limit the subset of ways. In order to transition out of fat mode, a writeback and invalidate operation must be performed on certain cache lines, as described herein. However, the benefit of operating in fat mode for some code streams may be worth the penalty associated with the writeback and invalidate operation. Transitions to or from skinny mode do not require the writeback and invalidate operation.
Referring now to
Referring now to
At block 302, the cache memory 102 receives a load request from a processing core while the mode 108 input indicates normal mode. The load request includes a memory address 104. Flow proceeds to block 304.
At block 304, the cache memory 102 selects a single set, referred to in
At block 306, the cache memory 102, for each entry in all 16 ways of the selected set J, compares all bits of the memory address 104 tag 112 except the LSB with all bits of the entry tag 106 except the LSB. The compare also checks to see if the entry is valid. Flow proceeds to decision block 308.
At decision block 308, the cache memory 102 determines whether the compare performed at block 306 resulted in a valid match. If so, flow proceeds to block 312; otherwise, flow proceeds to block 314.
At block 312, the cache memory 102 indicates a hit. Flow ends at block 312.
At block 314, the cache memory 102 allocates an entry in the selected set J. Preferably, the cache memory 102 allocates an entry from a way in set J that was least-recently-used (LRU) or pseudo-LRU, although other replacement algorithms may be employed, such as random or round-robin. Flow ends at block 314.
Referring now to
Referring now to
At block 502, the cache memory 102 receives a load request from a processing core while the mode 108 input indicates fat mode. The load request includes a memory address 104. Flow proceeds to block 504.
At block 504, the cache memory 102 selects two sets, referred to in
At block 506, the cache memory 102, for each entry in all 32 ways of the selected sets J and K, compares the memory address 104 tag 112 with the entry tag 106. The compare also checks to see if the entry is valid. Flow proceeds to decision block 508.
At decision block 508, the cache memory 102 determines whether the compare performed at block 506 resulted in a valid match. If so, flow proceeds to block 512; otherwise, flow proceeds to block 514.
At block 512, the cache memory 102 indicates a hit. Flow ends at block 512.
At block 514, the cache memory 102 selects one of sets J and K to be a replacement set. In one embodiment, the cache memory 102 selects the replacement set based on a hash of selected bits of the memory address 104 to a single bit such that if the hash yields a binary zero set J is selected and if the hash yields a binary one set K is selected, which generally serves to select the replacement set in a pseudo-random fashion. In another embodiment, the cache memory 102 selects the replacement set using an extra one or more bits of the replacement information stored for each set in addition to the information stored to select, for example, the LRU way of the set. For example, one extra bit may indicate whether set J or K was LRU. Flow proceeds to block 516.
At block 516, the cache memory 102 allocates an entry in the replacement set. Preferably, the cache memory 102 allocates an entry in the replacement set according to a least-recently-used (LRU) or a pseudo-LRU replacement scheme, although other replacement algorithms may be employed, such as random or round-robin. Flow ends at block 516.
Referring now to
At block 602, the cache memory 102 is instructed to transition out of fat mode, i.e., the mode 108 transitions from fat mode to either normal mode or skinny mode. Flow proceeds to block 604.
At block 604, the cache memory 102 searches through each set of the cache memory 102 (i.e., for each set number), and for each entry in the set, compares the LSB of the tag 106 with the MSB of the set number. If there is a mismatch, the cache memory 102 invalidates the entry. However, before invalidating the entry, if the status indicates the cache line is dirty, or modified, the cache memory 102 writes back the cache line data to memory. This operation serves to maintain coherency of the cache memory 102. Flow ends at block 604.
A potential disadvantage of running the fat mode is that it potentially consumes greater power than non-fat modes since two sets worth of tags must be compared. However, the tradeoff of power consumption for additional cache effectiveness may be desirable for some users in some systems. Additionally, in a multi-core processor, if fewer than all the cores are running, the additional tag array accesses (e.g., in the embodiment of
Referring now to
Referring now to
At block 802, the cache memory 102 receives a load request from a processing core while the mode 108 input indicates skinny-DM mode. The load request includes a memory address 104. Flow proceeds to block 804.
At block 804, the cache memory 102 selects a single set, referred to in
At block 806, the cache memory 102, for each entry in all 16 ways of the selected set J, compares all bits of the memory address 104 tag 112 except the LSB with all bits of the entry tag 106 except the LSB. The compare also checks to see if the entry is valid. Flow proceeds to decision block 808.
At decision block 808, the cache memory 102 determines whether the compare performed at block 806 resulted in a valid match. If so, flow proceeds to block 812; otherwise, flow proceeds to block 814.
At block 812, the cache memory 102 indicates a hit. Flow ends at block 812.
At block 814, the cache memory 102 allocates the entry in the way specified by MA[20:17] of the selected set J. In this manner, the cache memory 102 operates as a direct-mapped cache when configured in skinny-DM mode. Flow ends at block 814.
As mentioned above, advantageously, transitions to or from skinny mode do not require the writeback and invalidate operation. However, it should be noted that there may be a slight penalty in terms of the replacement bit values (e.g., LRU or pseudo-LRU bits) for a short time after the transition. For example, when transitioning from skinny mode to normal mode, the replacements bits may not have the expected normal mode LRU values, for example.
Referring now to
Referring now to
At block 1002, the cache memory 102 receives a load request from a processing core while the mode 108 input indicates skinny-8WAY mode. The load request includes a memory address 104. Flow proceeds to block 1004.
At block 1004, the cache memory 102 selects a single set, referred to in
At block 1006, the cache memory 102, for each entry in all 16 ways of the selected set J, compares all bits of the memory address 104 tag 112 except the LSB with all bits of the entry tag 106 except the LSB. The compare also checks to see if the entry is valid. Flow proceeds to decision block 1008.
At decision block 1008, the cache memory 102 determines whether the compare performed at block 1006 resulted in a valid match. If so, flow proceeds to block 1012; otherwise, flow proceeds to decision block 1013.
At block 1012, the cache memory 102 indicates a hit. Flow ends at block 1012.
At decision block 1013, the cache memory 102 examines bit MA[17]. If bit MA[17] is a binary one, flow proceeds to block 1016; otherwise, if MA[17] is a binary zero, flow proceeds to block 1014. As described above with respect to
At block 1014, the cache memory 102 allocates an entry in any of the even-numbered ways in the selected set. Preferably, the cache memory 102 allocates an entry in the selected even-numbered way according to a least-recently-used (LRU) or a pseudo-LRU replacement scheme, although other replacement algorithms may be employed, such as random or round-robin. Flow ends at block 1014.
At block 1016, the cache memory 102 allocates an entry in any of the odd-numbered ways in the selected set. Preferably, the cache memory 102 allocates an entry in the selected odd-numbered way according to a least-recently-used (LRU) or a pseudo-LRU replacement scheme, although other replacement algorithms may be employed, such as random or round-robin. Flow ends at block 1016.
It should be understood that although two skinny mode embodiments have been described, i.e., skinny direct-mapped mode and skinny 8-way mode, these are described to illustrate skinny mode, which is not limited to these embodiments. With respect to the illustrative embodiment of
Skinny mode may be beneficial for certain pathological programs that make very poor use of a LRU or pseudo-LRU replacement policy. For example, assume the program is marching through memory and has a pathological aliasing effect such that frequently when a load is requested it misses in the cache memory 102 and kicks out the very next line the program is going to need. However, when the effective associativity of the cache memory 102 is reduced by a transition to skinny mode, the problem is avoided.
For example, the program may be accessing a very large data structure in memory in which the lower half aliases into the upper half in the sets of the cache memory 102. However, the lower half and the upper half have different usage patterns that makes LRU replacement ineffective. By reducing the effective associativity of the cache memory 102 via skinny mode-SWAY, half the data structure is effectively insulated from the other half within the cache memory 102. This type of pathological case may be determined using offline analysis of the program, which may be used to reconfigure the cache memory 102, such as described below with respect to
For another example, assume the program is accessing two data sets that alias into the same set of the cache memory 102 because their addresses are identical except for differences in higher order bits of the tag 112. In this case, it may be beneficial to insulate the replacement policy of one of the data sets from the other. This may be achieved by using bits of the tag 112 that correspond to the higher order bits of the tag 112 that differ among the two data sets to generate the bits used to limit the subset of ways to be selected for replacement. This may be achieved, for example, using the methods described below with respect to
Referring now to
The cache memory 102 also includes two ports 1104, denoted port A 1104A and port B 1104B. Each port 1104 is coupled to each bank 1106. Each port 1106 receives the mode 108 as an input.
The cache memory 102 also includes two tag pipelines 1102, denoted tag pipeline A 1102A and tag pipeline B 1102B. Tag pipeline A 1102A accesses the banks 1106 through port A 1104A, and tag pipeline B 1102B accesses the banks 1106 through port B 1104B. Each tag pipeline 1102 receives the mode 108 as an input. The selection, or enablement, of the banks 1106 for set selection in the various modes is described in more detail below with respect to
Port A 1104A and port B 1104B can both be active at the same time as long as they are not both selecting the same bank 1106. This effectively provides a dual-ported cache memory 102 from four single-ported banks 1106. Preferably, arbitration logic of the cache memory 102 attempts to select arbitrating requests from the two tag pipelines 1102 that access non-conflicting banks 1106, particularly when the cache memory 102 is in fat mode.
Referring now to
The bank enable logic 1200A includes a first inverter 1204-0 that receives MA[7] 104-A, a second inverter 1208-0 that receives MA[6] 104-A, a first OR gate 1202-0 that receives the output of the first inverter 1204-0 and a fat mode indicator 1209, and a first AND gate 1206-0 that receives the output of the first OR gate 1202-0 and the output of the second inverter 1208-0 to generate EN0A 1212-0A, which is the bank 0 1106-0 enable for port A 1104A.
The bank enable logic 1200A also includes a third inverter 1204-1 that receives MA[7] 104-A, a second OR gate 1202-1 that receives the output of the third inverter 1204-0 and the fat mode indicator 1209, and a second AND gate 1206-1 that receives the output of the second OR gate 1202-1 and MA[6] 104-A to generate EN1A 1212-1A, which is the bank 1 1106-1 enable for port A 1104A.
The bank enable logic 1200A also includes a fourth inverter 1208-2 that receives MA[6] 104-A, a third OR gate 1202-2 that receives MA[7] 104-A and the fat mode indicator 1209, and a third AND gate 1206-2 that receives the output of the third OR gate 1202-2 and the output of the fourth inverter 1208-2 to generate EN2A 1212-2A, which is the bank 2 1106-2 enable for port A 1104A.
The bank enable logic 1200A also includes a fourth OR gate 1202-3 that receives MA[7] 104-A and the fat mode indicator 1209, and a fourth AND gate 1206-3 that receives the output of the fourth OR gate 1202-3 and MA[6] 104-A to generate EN3A 1212-3A, which is the bank 3 1106-3 enable for port A 1104A.
Referring to
Referring now to
The hit generation logic 1300 includes a comparator 1304 that receives the tag 106 and MA[35:16] 104. The comparator 1304 also receives the fat mode indicator 1209 of
The hit generation logic 1300 also includes a first OR gate 13124 that receives the set J way x hit signal 1308-Jx for each way of set J, where x is the way number, namely for 16 different ways, denoted 0 through 15 in
The hit generation logic 1300 also includes a second OR gate 1312-K that receives the set K way x hit signal 1308-Kx for each of the 16 ways of set K. Set K is the second set selected when in fat mode, e.g., the set selected by 1:MA[15:6], according to block 504 of
The hit generation logic 1300 also includes an OR gate 1316 that receives the set J hit signal 13144 and the set K hit signal 1314-K to generate a fat mode hit signal 1318. The hit generation logic 1300 also includes a mux 1322 that receives the set J hit signal 13144 and the fat mode hit signal 1318 and selects the former if the fat mode signal 1209 is false and the latter otherwise for provision on its output hit signal 1324 that indicates whether a hit in the cache memory 102 has occurred, such as a block 312, 512, 812 and 1012 of
Referring now to
At block 1402, the system detects that a new process, or program, is running. In one embodiment, system software running on the processor 100 detects the new process, e.g., a device driver monitors the operating system process table. The system software may provide information to the processor that may be used by the processor to detect that the program has entered each of different phases, such as described below with respect to
At block 1404, the cache memory 102 is transitioned, e.g., via the mode indicator 108, to a new mode previously determined to be a best-performing mode for the program or phase based on offline analysis of the process that was detected at block 1402. In one embodiment, microcode of the processor changes the mode 108 of the cache memory 102. If the cache memory 102 is transitioning out of fat mode, all memory operations are stopped, the operation described with respect to
Referring now to
Referring now to
At block 1422, the phase detector 1414 of
At block 1424, the mode update unit 1416 looks up the identifier of the new phase received from the phase detector 1414 in the mode information 1418 (e.g., received from the device driver at block 1404 of
At block 1426, the processor executes the running program and generates memory accesses to the cache memory 102, in response to which the cache memory 102 operates according to the updated mode 108 as performed at block 1424. Flow ends at block 1426.
Referring now to
At block 1502, the processor detects that the cache memory 102 is performing ineffectively in its current mode. For example, performance counters may indicate that the cache memory 102 is experiencing a miss rate that exceeds a threshold. Flow proceeds to block 1504.
At block 1504, the cache memory 102 is transitioned to a new mode different than its current mode. In one embodiment, microcode of the processor changes the mode 108 of the cache memory 102. If the cache memory 102 is transitioning out of fat mode, all memory operations are stopped, the operation described with respect to
Referring now to
Similar to the embodiment of
Referring now to
Examples of the tag 1612 bits selected and the function performed on the selected N bits 1738 are as follows. For one example, the subset is specified by a predetermined bit of the memory address 104 to be either the 8 odd-numbered ways of the selected set, or the 8 even-numbered ways of the selected set. In one example, the predetermined bit is the least significant bit of the tag 1612. In other examples, the predetermined bit is generated using other methods. For example, the predetermined bit may be generated as a Boolean exclusive-OR (XOR) of multiple bits of the tag 1612. This may be particularly advantageous where cache lines are pathologically aliasing into the same set, such as discussed above. Other functions than XOR may also be used to condense multiple bits of the tag 112 into a single bit, such as Boolean OR, Boolean AND, Boolean NOT, or various permutations thereof. For a second example, two or more bits of the tag 1612 are rotated a number of bits specified by the allocation mode 1608 with the result limiting the ways into which a cache line may be allocated to a subset of the total ways, e.g., from 16 to 4, 16 to 2, or 16 to 1 in the cases in which the N bits 1738 are 2, 3, or 4, respectively. Additionally, in the case where the N bits 1738 are 2, 3 or 4, each of the N bits 1738 may be separately generated by a Boolean function of the same or different bits of the tag 1612. Although specific embodiments are described, it should be understood that other embodiments are contemplated for the number and particular bits of the tag 1612 selected by the mux 1736, and other embodiments are contemplated for the particular functions 1732 performed on the selected N bits 1738 to select the subset of ways 1734.
Referring now to
At block 1802, the cache memory 1602 receives a load request from a processing core while the allocation mode 1608 indicates a current allocation mode. The load request includes the memory address 104 of
At block 1804, the cache memory 1602 selects a single set, referred to in
At block 1806, the cache memory 1602, for each entry in all 16 ways of the selected set J, compares the memory address 104 tag 1612 with the entry tag 1606. The compare also checks to see if the entry is valid. Flow proceeds to decision block 1808.
At decision block 1808, the cache memory 1602 determines whether the compare performed at block 1806 resulted in a valid match. If so, flow proceeds to block 1812; otherwise, flow proceeds to block 1814.
At block 1812, the cache memory 1602 indicates a hit. Flow ends at block 1812.
At block 1814, the logic 1702 of
At block 1816, the cache memory 1602 allocates into any one way in the selected set J that is in the subset of ways determined at block 1814. Preferably, the cache memory 1602 allocates into a way in the subset that was least-recently-used (LRU) or pseudo-LRU, although other replacement algorithms may be employed, such as random or round-robin. Flow ends at block 1816.
Referring now to
At block 1902, the processor monitors the effectiveness of the cache memory 102 (e.g., the hit rate of the cache memory 102 over a most recent predetermined period) while operating in a current allocation mode 1608. Flow proceeds to decision block 1904.
At decision block 1904, the processor determines whether the effectiveness of the cache memory 102 is below a threshold. If so, flow proceeds to block 1906; otherwise, flow ends. Preferably, the threshold is programmable, e.g., by system software.
At block 1906, the processor updates the allocation mode 1608 of the cache memory 102 to a new allocation mode different than its current allocation mode. In one embodiment, microcode of the processor updates the allocation mode 1608 of the cache memory 102. Preferably, the processor (e.g., microcode) keeps track of the updates to the allocation mode 1608 that are made in this fashion in order to avoid thrashing among the allocation modes, such as in the case of a program and/or data set that lends itself to a high miss rate regardless of the mode. In one embodiment, all of the allocation modes are attempted as necessary. In other embodiments, a subset of the allocation modes is attempted. Advantageously, there is no writeback-invalidate penalty associated with transitions between the different allocation modes 1608. Flow returns from block 1906 to block 1902.
The configuration of a cache memory mode in the various manners described herein, such as cache memory fat mode, skinny mode, allocation by function of tag replacement bits, may be either by static configuration, by dynamic configuration or both. Generally speaking, the static configuration is pre-silicon. That is, the designers employ intuition, preferably aided by software simulation of the processor design, to determine good configurations, that is, configurations that potentially improve the performance of the processor in general, and of the cache memory in particular. Improving performance of the processor is improving the speed at which the processor executes the program (e.g., reduces the clocks per instruction rate or increases the instructions per clock rate) and/or reduces the power consumption. The programs may be operating systems, executable programs (e.g., applications, utilities, benchmarks), dynamic link libraries, and the like. The software simulation may be employed to perform offline analysis of the execution of programs for which it is desirable to improve performance of the processor, as described below with respect to
In contrast, the analysis to determine dynamic configuration is performed post-silicon, generally speaking. That is, after the processor is manufactured, the designers perform offline analysis of a different kind to determine how the processor performs when executing the programs with configurations different than the static, or default, configuration manufactured into silicon. The post-silicon testing may involve a more rigorous, perhaps more brute force, technique in which automated performance regression against a configuration matrix is performed, and then the regression performance data is analyzed, as described below with respect to
Regardless of whether the testing is pre-silicon or post-silicon, with the dynamic configuration testing, good configurations are determined on a per-program basis, or even on a per-program phase basis. Then, when the system, e.g., a device driver, detects a known program is running on the processor (i.e., a program for which the analysis has been performed and a good configuration is known), the system provides the good program-specific configuration to the processor, and the processor updates the cache memory mode with the program-specific configuration in a dynamic fashion while the processor is running. Preferably, the program-specific configuration includes different configurations for different phases of the program, and the processor detects the phase changes and dynamically updates the configuration in response with the phase-specific configuration, as described with respect to
A program phase, with respect to a given set of characteristics, is a subset of a computer program characterized by a consistent behavior among those characteristics. For example, assume the relevant characteristics are branch prediction rate and cache hit rate, a phase of a program is a subset of the runtime behavior of the program in which the branch prediction rate and cache hit rate are consistent. For instance, offline analysis may determine that a particular data compression program has two phases: a dictionary construction phase and a dictionary lookup phase. The dictionary construction phase has a relatively low branch prediction rate and a relatively high cache hit rate, consistent with building a set of substrings common to a larger set of strings; whereas, the dictionary lookup phase has a relatively high branch prediction rate and a relatively low cache hit rate, consistent with looking up substrings in a dictionary larger than the size of the cache.
In one embodiment, offline analysis is performed using the notion of an “oracle cache,” which, as its name implies, knows the future. Given the limited amount of space in the cache memory, the oracle cache knows the most useful data that should be in the cache at any point in time. It may be conceptualized as a cycle-by-cycle or instruction-by-instruction snapshot of the contents of the cache that would produce the highest hit ratio.
First, one generates the sequence of oracle cache snapshots for a program execution and keeps track of the memory access that produced the allocation of each cache line in the snapshots. Then, on a subsequent execution instance of the program, the processor continually updates the cache mode using the information from the snapshots.
When it is impractical to update the cache mode on the granularity of a clock cycle or instruction, one examines the tendencies over much longer time durations, e.g., an entire program or program phase, e.g., by taking averages from the sequence of the program or phase.
Broadly speaking, the idea of the oracle cache is that, because it knows all of the memory accesses in advance, it can pre-execute all of the memory accesses. Then as the program executes, the oracle cache predicts the best set of cache lines to be in the cache at any given point in time. For instance, in the graph of
Referring now to
At block 3402, the designer, preferably in an automated fashion, runs a program and records memory accesses to the cache memory, e.g., 102, 1602, made by the program. Preferably, the allocations, hits and evictions of cache lines are recoded. The memory address and time (e.g., relative clock cycle) of the memory accesses are recorded. Flow proceeds to block 3404.
At block 3404, the designer, preferably in an automated fashion, analyzes the information recorded at block 3402 at regular time intervals and recognizes clear trends to separate the program into phases, e.g., as described below with respect to
At block 3406, the designer, preferably in an automated fashion, creates configurations for the different program phases based on the analysis performed at block 3404. For example, the configurations may be a cache memory mode. In one embodiment, the analysis to determine the configurations may include analysis similar that described below with respect to
Referring now to
Below the graph is shown, at each of eight different regular time intervals, the total working set size. The time intervals may be correlated to basic block transfers as described below with respect to
Additionally, observations may be made about how long cache lines tend to be useful, such as average cache line lifetime. The average cache line lifetime is calculated as the sum of the lifetime (from allocation to eviction) of all the cache lines over the phase divided by the number of cache lines. This information can be used to influence the operating mode of the cache memory.
If the oracle cache constrains the number of cached lines to correspond to the intended number of sets and ways that are included in the cache memory, the accuracy of the cache mode and average lifetime observations may increase. Other indicators may also be gathered, such as cache line hits.
Referring now to
At block 3602, a program for which it is desirable to improve performance by the processor when executing the program is analyzed and broken down to generate state diagrams. The nodes of the state diagram are basic blocks of the program. Basic blocks are sequences of instructions between program control instructions (e.g., branches, jumps, calls, returns, etc.). Each edge in the stage diagram is a target basic block to which the edge leads and state change information, which may become a phase identifier, as described more below. A phase identifier may include the instruction pointer (IP), or program counter (PC), of a control transfer instruction, a target address of the control transfer instruction, and/or the call stack of a control transfer instruction. The call stack may include the return address and parameters of the call. The program phases are portions of the programs that comprise one or more basic blocks. Flow proceeds to block 3604.
At block 3604, the program is instrumented to analyze characteristics related to configurable aspects of the processor such as cache memory configuration modes. Examples of the characteristics include cache hit rate, branch prediction accuracy, working set size, average cache line lifetime, and cache pollution (e.g., the number of cache lines prefetched but never used). Flow proceeds to block 3606.
At block 3606, the program is executed with a given configuration, e.g., of cache memory and/or prefetcher, and phases of the program are identified by observing steady state behavior in the analyzed characteristics of block 3604. For example, assume cache hit rate is the analyzed characteristic of interest, and assume the cache hit rate changes from 97% to 40%. The cache hit rate change tends to indicate that the cache memory configuration was good for the program prior to the change and not good for the program after the change. Thus, the sequence of basic blocks prior to the cache hit rate change may be identified as one phase and the sequence of basic blocks after the cache hit rate change may be identified as a second phase. For another example, assume working set size is the analyzed characteristic of interest, then significantly large shifts in working set sizes may signal a desirable location in the program to identify a phase change. Flow proceeds to block 3608.
At block 3608, once the phases are identified, good configurations or configuration values, are determined for each phase. For example, various offline analysis techniques may be used, such as the method described above with respect to
At block 3612, phase identifiers are correlated to the phase changes. The state change information, or potential phase identifiers, of the basic block transition described above at which a change in the analyzed characteristic occurred are recorded along with the good configuration values determined at block 3608 for the program so the information may be provided to the processor when it is detected, e.g., by a device driver, that the analyzed program is about to run. Flow proceeds to block 3614.
At block 3614, after receiving the information associated with the analyzed program, the processor loads the phase detectors 1414 with the phase identifiers 1412 of
Referring now to
At block 3702, for each program, or program phases, in a list of programs identified for which it is desirable to improve performance of the processor, the method iterates through blocks 3704 through 3716 until a good configuration is determined (e.g., the best current configuration—see below—has not changed for a relatively long time) or resources have expired (e.g., time and/or computing resources). Flow proceeds to block 3704.
At block 3704, the current best configuration is set to a default configuration, e.g., a default mode of the cache memory, which in one embodiment is simply the configuration with which the processor is manufactured. Flow proceeds to block 3706.
At block 3706, for each configuration parameter, blocks 3708 through 3712 are performed. An example of a configuration parameter is a single configuration bit, e.g., that turns a feature on or off. Another example of a configuration parameter is a configuration field, e.g., mode 108. Flow proceeds to block 3708.
At block 3708, for each value of a reasonable set of values of the configuration parameter of block 3706, perform blocks 3712 through 3716. A reasonable set of values of the configuration parameter depends upon the size of the configuration parameter, the deemed importance of the parameter, and the amount of resources required to iterate through its values. For example, in the case of a single configuration bit, both values are within a reasonable set. For example, the method may try all possible values for any parameter having sixteen or fewer values. However, for relatively large fields, e.g., a 32-bit field, it may be infeasible to try all 2{circumflex over ( )}32 possible values. In this case, the designer may provide a reasonable set of values to the method. If the designer does not supply values and the number of possibilities is large, the method may iterate through blocks 3712 through 3716 with a reasonable number of random values of the parameter. Flow proceeds to block 3712.
At block 3712, the program, or program phase, is run with the current best configuration but modified by the next value of the parameter per block 3708, and the performance is measured. Flow proceeds to decision block 3714.
At decision block 3714, the method compares the performance measured at block 3712 with the current best performance and if the former is better, flow proceeds to block 3716; otherwise, flow returns to block 3712 to try the next value of the current parameter until all the reasonable values are tried, in which case flow returns to block 3708 to iterate on the next configuration parameter until all the configuration parameters are tried, in which case the method ends, yielding the current best configuration for the program, or program phase.
At block 3716, the method updates the current best configuration with the configuration tried at block 3712. Flow returns to block 3712 to try the next value of the current parameter until all the reasonable values are tried, in which case flow returns to block 3708 to iterate on the next configuration parameter until all the configuration parameters are tried, in which case the method ends, yielding the current best configuration for the program, or program phase.
It should be noted that a good configuration found using methods similar to those of
Referring now to
The processor 3900 also includes a memory subsystem 3928 that provides memory operands to the execution units 3926 and receives memory operands from the execution units 3926. The memory subsystem 3928 preferably includes one or more load units, one or more store units, load queues, store queues, a fill queue for requesting cache lines from memory, a snoop queue related to snooping of a memory bus to which the processor 3900 is in communication, a tablewalk engine, and other related functional units.
The processor 3900 also includes a cache memory 102 in communication with the memory subsystem 3928. Preferably, the cache memory 102 is similar to the cache memories described with respect to
The memory subsystem 3928 makes memory accesses of the cache memory 102 as described in the embodiments of
Although embodiments have been described with a particular configuration of number of ports and banks of the cache memory, it should be understood that other embodiments are contemplated in which different numbers of ports are included in the cache memory, and in which different numbers of banks are included, as well a non-banked configurations. In the present disclosure, including the claims, the notation 2{circumflex over ( )}N means 2 to the exponent N.
While various embodiments of the present invention have been described herein, it should be understood that they have been presented by way of example, and not limitation. It will be apparent to persons skilled in the relevant computer arts that various changes in form and detail can be made therein without departing from the scope of the invention. For example, software can enable, for example, the function, fabrication, modeling, simulation, description and/or testing of the apparatus and methods described herein. This can be accomplished through the use of general programming languages (e.g., C, C++), hardware description languages (HDL) including Verilog HDL, VHDL, and so on, or other available programs. Such software can be disposed in any known computer usable medium such as magnetic tape, semiconductor, magnetic disk, or optical disc (e.g., CD-ROM, DVD-ROM, etc.), a network, wire line, wireless or other communications medium. Embodiments of the apparatus and method described herein may be included in a semiconductor intellectual property core, such as a processor core (e.g., embodied, or specified, in a HDL) and transformed to hardware in the production of integrated circuits. Additionally, the apparatus and methods described herein may be embodied as a combination of hardware and software. Thus, the present invention should not be limited by any of the exemplary embodiments described herein, but should be defined only in accordance with the following claims and their equivalents. Specifically, the present invention may be implemented within a processor device that may be used in a general-purpose computer. Finally, those skilled in the art should appreciate that they can readily use the disclosed conception and specific embodiments as a basis for designing or modifying other structures for carrying out the same purposes of the present invention without departing from the scope of the invention as defined by the appended claims.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/IB2014/003176 | 12/14/2014 | WO | 00 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2016/097795 | 6/23/2016 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
5301296 | Mohri | Apr 1994 | A |
5325504 | Tipley et al. | Jun 1994 | A |
5754820 | Yamagami | May 1998 | A |
5809562 | Gaskins et al. | Sep 1998 | A |
5974507 | Arimilli et al. | Oct 1999 | A |
6138209 | Krolak et al. | Oct 2000 | A |
6192458 | Arimilli et al. | Feb 2001 | B1 |
6223255 | Argade | Apr 2001 | B1 |
6405287 | Lesartre | Jun 2002 | B1 |
6446168 | Normoyle | Sep 2002 | B1 |
6643737 | Ono | Nov 2003 | B1 |
6681295 | Root et al. | Jan 2004 | B1 |
7406579 | Blanco | Jul 2008 | B2 |
7543113 | Walker et al. | Jun 2009 | B2 |
9495299 | Yu et al. | Nov 2016 | B2 |
20030070045 | Dwyer et al. | Apr 2003 | A1 |
20040098540 | Itoh et al. | May 2004 | A1 |
20060026356 | Okawa | Feb 2006 | A1 |
20060075192 | Golden et al. | Apr 2006 | A1 |
20070153014 | Sabol | Jul 2007 | A1 |
20070260818 | Damaraju et al. | Nov 2007 | A1 |
20080040730 | Kang et al. | Feb 2008 | A1 |
20090006756 | Donley | Jan 2009 | A1 |
20100077153 | Archambault et al. | Mar 2010 | A1 |
20100088457 | Goodrich et al. | Apr 2010 | A1 |
20100180083 | Lee et al. | Jul 2010 | A1 |
20100318742 | Plondke et al. | Dec 2010 | A1 |
20110113411 | Yonezu | May 2011 | A1 |
20110238919 | Gibson | Sep 2011 | A1 |
20120096226 | Thompson et al. | Apr 2012 | A1 |
20130304994 | Koob et al. | Nov 2013 | A1 |
20130339596 | Prasky | Dec 2013 | A1 |
20140047175 | Abali et al. | Feb 2014 | A1 |
20140143499 | Olson | May 2014 | A1 |
20160170884 | Eddy et al. | Jun 2016 | A1 |
20160293273 | Hooker et al. | Oct 2016 | A1 |
20160350229 | Reed | Dec 2016 | A1 |
20160357681 | Reed | Dec 2016 | A1 |
Number | Date | Country |
---|---|---|
1632877 | Jun 2005 | CN |
1685320 | Oct 2005 | CN |
103597545 | Feb 2014 | CN |
0549508 | Jun 1993 | EP |
0950223 | Oct 1999 | EP |
1 231 539 | Aug 2002 | EP |
1988466 | Nov 2008 | EP |
1993020193 | Jan 1993 | JP |
06231044 | Aug 1994 | JP |
1997062582 | Mar 1997 | JP |
2000020396 | Jan 2000 | JP |
2002236616 | Aug 2002 | JP |
2003131945 | May 2003 | JP |
2005293300 | Oct 2005 | JP |
2010170292 | Aug 2010 | JP |
201140319 | Nov 2011 | TW |
201346557 | Nov 2013 | TW |
201443640 | Nov 2014 | TW |
WO2013098919 | Apr 2013 | WO |
Entry |
---|
Agarwal et al. “Column associative caches: A technique for reducing the miss rate of direct-mapped caches”. 1993. p. 179-190. IEEE. |
Hennessy et al. “Computer organization and design: The hardware/software interface”.1998. p. 556-557 & 568-575. Morgan Kaufmann Publishers, Inc. San Francisco, CA, USA. (Year: 1998). |
Yang, Se-Hyun et al. “Dynamically Resizeable Instruction Cache: An Energy-Efficient and High-Performance Deep-Submicron Instruction Cache.” Purdue e-Pubs. ECE Technical Reports. Electrical and Computer Engineering. May 1, 2000 pp. 1-32. |
Zhang, Chenxi et al. “Two Fast and High-Associativity Cache Schemes.” IEEE Micro. Sep./Oct. 1997, pp. 40-49. |
PCT/IB2014/003231. International Search Report (ISR) and Written Opinion (WO). Provided by State Intellectual Property Office of the P.R. China. Sep. 9, 2015. pp. 1-8. |
PCT/IB2014/003176. International Search Report (ISR) and Written Opinion (WO). Provided by State Intellectual Property Office of the P.R. China. Aug. 28, 2015. pp. 1-8. |
PCT/IB2014/003225. International Search Report (ISR) and Written Opinion (WO). Provided by State Intellectual Property Office of the P.R. China. Sep. 9, 2015. pp. 1-8. |
Ravindran, Rajiv et al. “Compiler-Managed Partitioned Data Caches for Low Power.” Proceedings of the 2007 ACM SIGPLAN/SIGBED Conference on Languages, Compiler, and Tools for Embedded Systems. LCTES'07. Jun. 13-15, 2007. pp. 237-247 San Diego, CA. |
WIlkerson et al. “Trading off Cache Capacity for Low-Voltage Operation.” IEEE Computer Society. Intel. Jan. 2009, pp. 96-103. |
Number | Date | Country | |
---|---|---|---|
20160357664 A1 | Dec 2016 | US |