1. Field of the Invention
This invention relates to the field of data processing systems. More particularly, this invention relates to cache memory techniques in data processing systems.
2. Description of the Prior Art
It is known to use cache memory to increase the efficiency with which data is retrieved from a main memory of a data processing system. More frequently accessed data and/or instructions are stored in cache, which, due to its size and physical characteristics is more rapidly accessible than main memory. Cache tags are used to locate information corresponding to a given memory address in cache. Known data processing systems have one or more levels of cache, which can be arranged hierarchically such that caches at successive levels of the hierarchy are sequentially accessed.
However, caches can account for a significant proportion of the power consumption of a data processing apparatus. For example, a level one (L1) cache may account for about fifty percent of a processor's power and the cache tag look up of such an L1 cache could account for around forty percent of the power consumption of the cache itself. For set-associative caches, which comprise a plurality of cache arrays, as the number of cache arrays increases, the cache tag look-up power consumption increases. In fact, the cache tag look-up for an L1 cache can account for around twenty percent of a processor's total power consumption.
There are a number of known schemes to ameliorate the effects of the large power consumption of caches in data processing systems. One such known system is to use, in addition to a standard cache, a loop cache to store loops of program instructions. The loop cache is typically located in an alternative access pathway to the L1 cache. Loop caches can be used to reduce the power consumption of instruction caches but not data caches.
Loop caches can reduce L1 instruction cache power consumption by around forty percent so that overall the processor power consumption is reduced by around twenty percent.
Other known systems comprise filter caches, which can be used to reduce cache power consumption for both data caches and instruction caches. Filter caches are typically implemented as small level zero (L0) caches between the processor and the L1 cache. The fact that the filter cache is used at L0 of the cache hierarchy means that they can adversely impact the processor's performance due to high filter cache miss rates.
However, filter caches can still reduce overall processor power consumption.
In order to make data processing systems more efficient it is desirable to further reduce the power consumption of cache memory systems.
According to a first aspect the present invention provides apparatus for processing data comprising:
a cache memory having a data storage array comprising a plurality of cache lines and a cache tag array providing an index of memory locations associated with data elements currently stored in said cache memory;
a cache controller coupled to said cache memory and responsive to a cache access to perform a cache lookup with reference to said cache tag array to establish whether a data element corresponding to a given memory address is currently stored in said cache memory and, if so, to identify a mapping between said given memory address and a corresponding cache storage location;
a location-specifying memory operable to store at least a portion of said mapping determined during said cache lookup;
wherein upon a subsequent cache access to said given memory address said cache controller is arranged to access said location-specifying memory and to use said stored mapping to access said data element corresponding to said given memory address in said data storage array of said cache memory instead of performing said cache lookup.
The present invention according to this first aspect recognises that provision of a location-specifying memory to store at least a portion of a mapping determined during a cache look up can reduce the instruction cache power consumption by improving the efficiency with which data is accessed. The stored mapping data can be used to perform subsequent accesses, which avoids the requirement to perform a power-hungry cache look up involving a plurality of cache tag Random Access Memories (RAMs). Furthermore, storing the mapping data or a portion thereof means that the corresponding instruction or data itself need not be stored in the loop cache or filter cache but instead can be readily accessed in, for example, an L1 cache using the stored mapping data. In this way, the gate-count and power consumption of the cache memory system can be reduced. The location-specifying memory can thus be reduced in complexity and will be simpler to manufacture than a loop or filter cache that is required to store the full cached data or instruction.
It will be appreciated that the location-specifying memory could be an integral part of the cache memory. However, in one embodiment, the data processing apparatus comprises a further cache memory and the further cache memory comprises the location-specifying memory.
Although the further cache memory could store at least a portion of the cache line data corresponding to the stored mapping data, in one embodiment the further cache memory stores the mapping data without storing corresponding cache line data and the data processing apparatus is configured to use the mapping data from the further cache memory to retrieve the information from the cache memory. This allows the storage capacity of the further cache to be more efficiently used by reducing the number of power-hungry cache tag look-ups yet obviating the need to replicate full cache lines.
Although the cache memory system can be configured such that the cache memory and the further cache memory are provided on alternative access paths on the same hierarchical level, in one embodiment the cache memory and the further cache memory form a cache hierarchy having a plurality of hierarchical levels and the further cache memory belongs to a lower hierarchical level than the cache memory. In one such embodiment the further cache memory is a filter cache and the cache memory is a level-one cache.
In one embodiment the further cache memory is a buffer memory. This is straightforward to implement.
In some embodiments of the invention the cache memory is an instruction cache and in other embodiments the cache memory is a data cache. In yet other embodiments the cache memory caches both instructions and data.
It will be appreciated that the further cache memory could comprise any type of cache memory. However, in one embodiment, the data processing apparatus comprises loop detection circuitry and the further cache memory is a loop cache.
The cache memory could be any type of cache memory such as a directly-mapped cache, but in one embodiment the cache memory is a set-associative cache memory having a plurality of cache ways.
In one such embodiment having a set-associative cache, the mapping comprises at least one of an index specifying a set of cache lines and cache way information corresponding to the given memory address. This information can be stored compactly yet enables the corresponding information to be readily and efficiently retrieved from the cache without the need to perform a cache-tag look up.
In one embodiment the data processing apparatus comprises invalidation circuitry coupled to the cache memory and the location-specifying memory, wherein the invalidation circuitry is arranged to selectively invalidate mapping data stored in the location-specifying memory when a corresponding line of the cache memory is invalidated.
In an alternative embodiment the invalidation circuitry is configured to flush all of the mapping information from the location-specifying memory when at least one line of the cache memory is invalidated.
It will be appreciated that the mapping information could be stored in the location-specifying memory following any cache tag look-up. However, in one embodiment the data processing apparatus comprises a main memory and the mapping information is stored in the location-specifying memory in response to the information being retrieved from the main memory and stored in the cache. This further reduces the required number of cache tag look ups relative to only storing the mapping data in response to a cache hit.
According to a second aspect, the present invention provides an apparatus for processing data comprising:
a pipelined processing circuit for executing program instructions including conditional branch instructions;
a cache memory;
loop detection circuitry responsive to memory addresses of instructions to detect program loops;
a buffer memory coupled to the cache memory and the loop detection circuitry, the buffer memory being arranged to store instruction data for at least a portion of one of the detected program loops;
branch prediction circuitry configured to generate branch prediction information providing a prediction of whether a given one of the conditional branch instructions will result in a change in program execution flow;
control circuitry coupled to the buffer memory and the branch prediction circuitry, the control circuitry arranged to control the buffer memory to store program instructions in dependence upon the branch prediction information.
The present invention according to this second aspect recognises that the loading of instructions corresponding to detected program loops into the buffer memory consumes power and accordingly that efficiency can be improved by only selectively storing detected program loops based on offsetting the performance gains that are achievable by repeatedly accessing instructions of the program loop from the buffer memory rather than from cache memory or main memory against the power consumed in order to store instruction loops in the buffer memory. This is achieved by feeding branch prediction information from the branch prediction circuitry to the control circuitry that controls the buffer memory to store detected program loops such that only those instructions that are most likely to be part of a successively iterated loop are in fact stored in the buffer memory.
In one embodiment the buffer memory is coupled to the loop detection circuitry, the buffer memory being arranged to store at least a portion of the program loop. This provides for more efficient repeated access to program instructions of a repeatedly iterated loop of program instructions.
In one embodiment the apparatus comprises branch target prediction circuitry for predicting a branch-target instruction address corresponding to the given conditional branch instruction. In one such embodiment the branch prediction information comprises the branch-target instruction address. This provides for timely identification of candidate loop instructions for storing in the buffer memory.
In one embodiment, the loop detection circuitry performs the detection of program loops by statically profiling a sequence of program instructions. In an alternative embodiment, the loop detection circuitry performs the detection of program loops by dynamically identifying small backwards branches during execution of program instructions. This scheme reliably identifies most program loops yet is straightforward to implement.
In one embodiment the branch prediction circuitry is configured to provide a likelihood value giving a likelihood that a predicted branch will be taken and wherein the control circuitry is responsive to the likelihood value to control storage of program instructions corresponding to the predicted branch.
In one embodiment the buffer memory is configured to store mapping information providing a mapping between a memory address and a storage location of one or more program instructions corresponding to the memory address in the cache memory.
In one such embodiment the buffer memory is configured to store the mapping information without storing the corresponding program instructions and wherein the data processing apparatus is configured to use the mapping information to retrieve the program instructions from the cache memory.
The above, and other objects, features and advantages of this invention will be apparent from the following detailed description of illustrative embodiments which is to be read in connection with the accompanying drawings.
The control circuitry 110 is responsive to a request from a data processor (not shown) to perform a cache access operation (read or write) by looking up the tag RAM 120 to determine whether data corresponding to a given memory address is stored in a corresponding location in the data RAM 130.
In the cache 100 of
The tag RAM 120 comprises four individual data arrays 122, 124, 126 and 128 having a one-to-one correspondence with the four data arrays of the data RAM 130 i.e. arrays 132, 134, 136 and 138. Since the cache 100 has four data arrays it is referred to as a “four-way” set associative cache. The 8 kilobyte cache 100 comprises a total of 512 16-byte cache lines and each data array of the data RAM 130 comprises 128 cache lines. The tag RAM 120 provides a mapping between an incoming memory address, in this case a 32-bit address, and a data storage location within the data RAM 130.
A processor (not shown) selects a particular set of cache lines using a “data RAM index” comprising subset of the address bits of the 32-bit memory address. Within each data RAM array 132, 134, 136, 138 there are four cache lines that could map to a given memory address. The control circuitry 110 uses a mapping algorithm to select one of the four cache lines within the set on a cache line fill.
As shown in
In
During the look up of the tag RAM 120 of
As mentioned above, there are a number of known schemes for reducing cache power consumption. For example, a loop cache can be provided to store loops of frequently executed instructions or a filter cache can be provided as an L0 cache (between a processor and an L1 cache) to store data or instructions.
The present technique provides a way of reducing the power consumption associated with the parallel look up of the tag RAM 120 (see
The processor 210 performs data processing operations using the pipeline processing circuit 220, which comprises the fetch stage 222, a decode stage 224 and an execute stage 226. The processor 210 fetches instructions for execution from the main memory 270. Access to data/instructions stored in the main memory 270 is made more efficient by provision of the off-chip L1 instruction cache 262 and L1 data cache 264, which store copies of recently accessed information.
The memory access is hierarchical, i.e., the processor 210 first checks whether or not the information is stored in one of the L1 caches 262, 264 before attempting to retrieve that data from the main memory 270. An additional L0 cache may be provided on-chip (not shown). The loop cache 250 is at the same hierarchical level as the L1 instruction cache 262 and both of these caches are connected to the processor 210 via the multiplexer 280.
The loop cache 250 comprises loop detection circuitry 254 in the cache, which is responsive to branch instructions or memory addresses to detect program loops. The loop cache 250 stores sequences of instructions corresponding to detected program loops to speed up access to those instructions. The L1 data cache 264 is accessed via a different communication path from the loop cache 250 so the presence of the loop cache 250 should not adversely affect the data access time. The loop cache 250 consumes less power per access, is smaller and has fewer cache ways than the L1 instruction cache 262. In this embodiment, the loop detection circuitry 254 dynamically identifies program loops by detecting “small backwards branches (SBB)”. Note that the SBB loop detection scheme is unlikely to capture loops that contain internal branches. In alternative arrangements, the loop detection circuitry 254 identifies loops by statically profiling program code.
The loop cache 250 of
The branch prediction unit 230 determines whether or not a conditional branch in the instruction flow of a program being executed by the processor 210 is likely to be taken or not. The branch prediction unit 230 allows the processor 210 to fetch and execute program instructions without waiting for a branch to be resolved. This improves the throughput of instructions. Most pipelined processors, like the processor 210 of
The branch prediction unit 230 comprises branch target prediction circuitry 232, which is configured to determine the target of a conditional branch instruction or an unconditional jump before it is actually computed by the processor parsing the instruction itself. Effectively, the branch target prediction circuitry 232 predicts the outcome of the conditional branch (or unconditional branch) The branch prediction circuitry 230 is coupled to the loop cache 250 and in particular to the control circuitry 240. However, in alternative arrangements, a different communication path may be provided for supply of the branch prediction information to the loop cache.
The control circuitry 240 supplies branch prediction information from the branch prediction circuitry 230 to the loop cache 250, which uses this information to determine whether or not to load program instructions corresponding to a particular loop. Furthermore, in the example arrangement of
The filter caches 422, 424 are small by comparison to the respective L1 caches 432, 434 and their architecture means that look-ups in the filter caches 422, 426 consume considerably less power than performing an access to the corresponding L1 cache. However, these characteristics of the filter caches have the consequence that the filter caches 422, 426 have high miss rates. It follows that the reduced power consumption achievable by use of the filter caches is slightly offset by the reduction in processor performance resulting from the high miss rates. Filter caches can reduce cache power consumption by around 58% whilst reducing processor performance by around 21% so that the overall processor power consumption can still be reduced around 29% relative to a system without the filter caches.
In the arrangement according to the present technique, the data storage capacity of the filter caches 422, 426 is efficiently used by storing mapping data that maps a memory address to a data storage location in the corresponding L1 cache 432, 434. This mapping data is used on subsequent accesses to that memory address (for as long as it remains stored in the filter and L1 cache). The mapping data stored by the filter caches 422, 426 of
If there is a also miss in the L1 cache at stage 530, then the process proceeds to stage 532 where the processor fetches the data from the main memory 270 and stores it in the L1 cache 262 and thereafter the process proceeds to stage 540. If, on the other hand, it is determined at stage 530 that the instruction is currently stored in the L1 cache then the process also proceeds directly to stage 540. At stage 540 the loop detection circuitry 254 determines whether the instruction corresponding to the transaction at stage 510 is associated with a Small Backwards Branch. If the requested instruction is not identified as corresponding to a Small Backwards Branch then the process proceeds directly to stage 560 where the requested instruction is retrieved from the L1 cache and returned to the data processor. However, if at stage 540 it is determined that the current instruction that does in fact correspond to a Small Backwards Branch then the process proceeds from stage 540 to stage 550 (prior to progressing to stage 560).
At stage 550 the start and end address of the loop and L1 cache mapping information for the instruction (cache way) is stored in the loop cache 250. For simplicity the flow diagram shows all of the L1 cache mapping information being copied into the loop cache at this stage. However it is expected that a number of transactions 510 will be required to copy all the L1 cache mapping information for a loop into the loop cache 250. Accordingly, on a subsequent iteration of the loop, the instructions can be retrieved form the L1 cache based on the mapping data stored in the loop cache. The process of the flow chart of
The process then proceeds to stage 624 where the mapping data for the newly detected loop are stored in the loop cache 250, the start and end address corresponding to the loop are set and the loop cache is marked as valid. Recall that according to the present technique the instructions per se need not be stored in the loop cache, but instead the start and end addresses corresponding to the instruction loop and the mapping data providing the mapping between the relevant instructions and locations in the L1 instruction cache 262 are stored in the loop cache 250.
Returning to stage 620, if it is decided that the instruction to be fetched does not in fact belong to a loop (i.e. does not correspond to a Small Backwards Branch) then the process proceeds to stage 630 where it is determined whether or not information stored in the L1 instruction cache 262 or indeed the loop cache 250 has been invalidated. If so then the invalidation of the L1 instruction cache or the loop cache is performed at stage 650 and the process then returns to stage 610 where a new address/instruction is fetched. However, if it is instead determined at stage 630 that there has been no invalidation of either the L1 instruction cache or the loop cache then the process proceeds to stage 640. At stage 640 it is determined whether the address of the instruction to be fetched is outside the start and end addresses of the loop of instructions currently stored by the loop cache 250.
If the instruction currently being fetched is outside the start and end addresses of the loop stored in the loop cache then the process proceeds to stage 650 where the loop cache is invalidated. This is because all iterations of the loop are judged to be complete when an instruction outside the loop is encountered. If, on the other hand, the address of the instruction to be fetched is contained between the start and end address of the loop cache (i.e. it is an instruction belonging to the loop) then the process returns to stage 610 where the next instruction is fetched.
The flow chart of
At stage 910 a memory-access transaction is received for processing by the pipelined processing circuitry of the processor 410 (see
If there is a hit in the appropriate filter cache 422, 426 at stage 920, the process proceeds to stage 930 where the data is returned from the filter cache to the processor 410. If, on the other hand, there is a filter cache miss at stage 920 then the appropriate L1 cache (either the instruction cache 432 or the data cache 434 is accessed). If there is a hit in the L1 cache at stage 940, then the process proceeds to stage 942, whereupon the cache way determined during the L1 cache look-up is copied into the corresponding filter cache 422 or 426. The process then proceeds to stage 930 where the data is returned from the L1 cache.
However, if there is a miss in the L1 cache at stage 940, the process proceeds to stage 950 whereupon the main memory 440 is accessed to retrieve the requested data or instruction. Once the information has been retrieved from the main memory at stage 950, it is copied into the L1 instruction cache 432 or the L1 data cache 434 at stage 960. The process then proceeds to stage 942 where the mapping information that was used at stage 960 to determine where in the L1 cache to store the data that was retrieved from main memory is written into the appropriate filter cache 422 or 426. In particular, the mapping information is written into the location-specifying memory 424 or 428 of the corresponding filter cache. Once the mapping information has been stored in the appropriate filter cache at stage 942, the process proceeds to stage 930 where the required data/instruction is returned to the processor 410.
Note that in the process illustrated by
In the alternative filter cache invalidation scheme of
Although illustrative embodiments of the invention have been described in detail herein with reference to the accompanying drawings, it is to be understood that the invention is not limited to those precise embodiments, and that various changes and modifications can be effected therein by one skilled in the art without departing from the scope and spirit of the invention as defined by the appended claims.