The technology of the disclosure relates to computing systems employing a cache system having at least two levels and a prefetcher for issuing speculative memory fetches.
Computer processors or processor cores (“processors”) can execute many instructions per second. If instructions and/or the data processed by the instructions (both referred to here as “data”) are not supplied to the processor when they are needed, the processor may be forced to sit idle while waiting for the needed data. For this reason, a memory system associated with the processor includes a small first-level (L1) cache memory to store copies of data currently being used or expected to be used by the processor. The L1 cache is located close to the processor so that data can be quickly read and written. When the processor requests information from a memory location for which there is no copy in the L1 cache memory, a memory request is made to a second-level (L2) cache type of memory. The L2 cache is typically larger and takes longer to access than the L1 cache. If the L2 cache has the data that satisfies the memory request, the data can be quickly provided to the L1 cache and the processor. On the other hand, if the L2 cache does not have a copy of the requested data, the processor is forced to wait for the memory system to access a higher-level cache or external memory in the memory system.
A prefetcher may be employed in a memory system cache to anticipate the data that the processor will need based on data that has been previously fetched from memory. In other words, based on the addresses of instructions and data that the processor previously fetched, the prefetcher can request, from a higher-level memory or external memory, the contents (e.g., a cache line) of the next memory addresses expected to be accessed. Ideally, a prefetcher can minimize or eliminate idle time in the processor by ensuring that the data needed by the processor is available in the L2 cache when it is requested by the L1 cache. However, correctly identifying the instructions and data that will be accessed next may be difficult or impossible. As a result, the prefetcher may guess incorrectly, prefetching data that is not needed and causing the processor to be idle, which wastes processor bandwidth. For this reason, prefetchers are sometimes disabled by users because they can guess wrong too frequently and provide no performance improvement.
Aspects disclosed herein include a multi-trained scalable prefetcher. Related methods and computer-readable media are also disclosed. Data stored in a cache may have been demand requested by a processor or speculatively prefetched based on previously requested data. An exemplary multi-trained scalable prefetcher generates speculative prefetch requests to memory addresses based on address offsets generated by various address offset generators. The multi-trained scalable prefetcher includes a best address offset generator that keeps track of how successful address offsets, from a set of address offsets, would have been for prefetching data requested by a processor based on addresses of data previously fetched by the processor. The best offset generator provides the best address offset to a prefetch generator. The multi-trained scalable prefetcher also includes at least one additional address offset generator that provides at least a second address offset to the prefetch generator. In one example, the multi-trained scalable prefetcher includes a second-best offset generator that determines a second-best offset address based on the set of address offsets.
In another aspect, a number of prefetch requests generated by the multi-trained scalable prefetcher may be scaled in response to an indication of an activity level on a data interface. For example, scaling the number of prefetch requests may include the prefetch generator not providing any prefetch requests to a prefetch request buffer if the level of activity on the data interface is too high. Alternatively, scaling includes increasing or decreasing a number of prefetch requests provided to the prefetch request buffer based on priority levels of the address offset generators and the indicated activity level. For example, prefetch requests based on address offsets from lower priority address offset generators may be paused when the activity level on the data interface exceeds a threshold, while prefetch requests based on address offsets from higher priority address offset generators continue to be provided to the prefetch request buffer. As an example, the indication of activity level of the data interface may be a CBUSY signal.
In this regard, in one exemplary aspect, a memory system is provided. The memory system includes a prefetcher generator configured to receive a first memory address of requested data, generate a first plurality of address offsets, and generate a first plurality of prefetch requests directed to one or more prefetch addresses, each of the one or more prefetch addresses based on the first memory address and a corresponding one of the first plurality of address offsets.
In another exemplary aspect, a method of generating prefetch requests to a prefetch buffer in a memory system is disclosed. The method includes receiving a first memory address of requested data, generating a first plurality of address offsets, and generating a first plurality of prefetch requests directed to one or more prefetch addresses, each of the one or more prefetch addresses based on the first memory address and a corresponding one of the first plurality of address offsets.
In another exemplary aspect, a non-transitory computer-readable medium is disclosed. The non-transitory computer-readable medium includes having stored thereon computer-executable instructions which, when executed by a processor, cause a memory system to receive a first memory address of requested data: generate a first plurality of address offsets: and generate a first plurality of prefetch requests directed to one or more prefetch addresses, each of the one or more prefetch addresses based on the first memory address and a corresponding one of the first plurality of address offsets.
Those skilled in the art will appreciate the scope of the present disclosure and realize additional aspects thereof after reading the following detailed description of the preferred embodiments in association with the accompanying drawing figures.
The accompanying drawing figures incorporated in and forming a part of this specification illustrate several aspects of the disclosure and, together with the description, serve to explain the principles of the disclosure.
Aspects disclosed herein include a multi-trained scalable prefetcher. Related methods and computer-readable media are also disclosed. Data stored in a cache may have been demand requested by a processor or speculatively prefetched based on the previously requested data. An exemplary multi-trained scalable prefetcher generates speculative prefetch requests to memory addresses based on address offsets generated by various address offset generators. The multi-trained scalable prefetcher includes a best address offset generator that keeps track of how successful address offsets, from a set of address offsets, would have been for prefetching data requested by a processor based on addresses of data previously fetched by the processor. The best offset generator provides the best address offset to a prefetch generator. The multi-trained scalable prefetcher also includes at least one additional address offset generator that provides at least a second address offset to the prefetch generator. In one example, the multi-trained scalable prefetcher includes a second-best offset generator that determines a second-best offset address based on the set of address offsets.
In another aspect, a number of prefetch requests generated by the multi-trained scalable prefetcher may be scaled in response to an indication of an activity level on a data interface. For example, scaling the number of prefetch requests may include the prefetch generator not providing any prefetch requests to a prefetch request buffer if the level of activity on the data interface is too high. Alternatively, scaling includes increasing or decreasing a number of prefetch requests provided to the prefetch request buffer based on priority levels of the address offset generators and the indicated activity level. For example, prefetch requests based on address offsets from lower priority address offset generators may be paused when the activity level on the data interface exceeds a threshold, while prefetch requests based on address offsets from higher priority address offset generators continue to be provided to the prefetch request buffer. As an example, the indication of activity level of the data interface may be a CBUSY signal.
A prefetcher generates a prefetch request for data that a processor will likely need in the near future. A prefetcher determines, based on a memory address of a fetch issued by a processor, one or more memory addresses likely to be fetched next. Accurate prefetching can ensure that the next data needed by the processor is available before it is requested by the processor, which reduces processor idle time and improves performance. Instructions executed by a processor are often stored in a predictable order (e.g., sequential). Instructions executed by a processor may also access data from a predictable pattern of memory addresses. A prefetch request is directed to a prefetch address based on the most recently fetched memory address. One or more prefetch addresses may be determined by adding one or more address offsets to the fetched memory address. The algorithm employed in the prefetcher may determine an address offset, which is an offset from the address of the most recently fetched data to the address of data that should be prefetched according to a pattern, for example. The prefetch address can be determined speculatively based on adding the address offset to the fetched memory address.
The best address offset 102 is determined as follows. The BO generator 100 receives request addresses 108, which are the memory addresses of data that has been retrieved by a prefetch request and determines the memory address of a fetch that triggered the prefetch request. These memory addresses of recently requested data are stored in a recent request table 110. The memory addresses, referred to herein as recent requests 112, are provided to a learning circuit 114. The learning circuit 114 determines which address offset, of a predetermined set of address offsets, would have most likely caused a prefetch request to the new fetch memory address 104 based on the recent requests 112 in the recent request table 110.
The multi-trained scalable prefetcher 200 may generate the plurality of potential prefetch requests 202 to prefetch data from a higher-level cache or a memory into a lower-level cache closer to a processor. For example, the multi-trained scalable prefetcher 200 may be employed in a second level (L2) cache to increase the probability that data requested by the processor will be speculatively prefetched and available in the L2 cache for low latency access before it is requested by the processor. Having the requested data already available in this manner can reduce or eliminate a number of clock cycles in which the processor would otherwise be stalled while waiting for the data to be retrieved from the memory system.
For example, a processor may first submit a demand request for data in response to a read or write instruction executed in the processor. The request and an accessed memory address may first be received by a first-level (L1) cache. If the data corresponding to the accessed memory address is not found in the L1 cache, the L1 cache generates an L1 miss indication. In response to the miss indication, the request may be forwarded to an L2 cache to see if the requested data is stored there. If the multi-trained scalable prefetcher 200 has already prefetched the requested data and stored it in the L2 cache, an L2 hit indication is generated, and the data is provided to the processor. Alternatively, the requested data may not be stored in the L2 cache, so an L2 miss indication is generated. In response to the L2 miss, a request for data at the accessed memory address is sent to a higher level (e.g., L3) cache or a memory in the memory system.
The L2 cache may also keep track (e.g., using a “prefetch bit”) of whether data stored in the L2 cache was prefetched but has not yet been accessed by the processor. When the processor (or the L1 cache associated with the processor) issues a request for data and the L2 cache generates an L2 miss or indicates both that there is a hit and the prefetch bit is set, the multi-trained scalable prefetcher 200 is configured to generate a plurality of potential prefetch requests 202 based on the accessed memory address 204. The potential prefetch requests 202 are directed to addresses that may be storing data that the processor will request soon. For example, the processor may fetch an uninterrupted series of data or instructions at consecutive memory addresses. In such circumstances, soon after data at a cache line X is requested, the processor may request data at a cache line X+1.
In other situations, the processor may not consistently access data sequentially or at a consistent rate. In addition, the processor may branch to a non-consecutive memory address to access new instructions or data in non-consecutive memory locations. Therefore, it is difficult to accurately predict, with regularity, which memory addresses the processor will access next. For this reason, rather than sending out only one prefetch request to a first address in response to a request to an accessed memory address 204, a prefetch generator 210 may generate potential prefetch requests 202 based on address offsets from the plurality of offset generators 206, where each potential prefetch request 202 may be directed to a different memory address.
Each of the potential prefetch requests 202 is generated by adding an address offset to the accessed memory address 204 of the current request if the request results in a miss or in a hit with the prefetch bit set (e.g., in the L2 cache). In this regard, a plurality of prefetch requests are generated based on multiple address offsets, and the address offsets are generated by the offset generators 206, each of which employs a distinct respective algorithm.
A first example of the offset generator is a trainable offset (TO) generator 212, which may correspond in some aspects to the BO generator 100 described with reference to
The memory addresses 216 of recently prefetched data are stored in a recent request table 218 in the TO generator 212 for use by the BO generator 206(A). The memory addresses 216 are generated in an address reconstructor 220 and provided to the recent request table 218. When the data fetched by a potential prefetch request 202 is received and stored (e.g., in the L2 cache), a fetched memory address 222 of the fetched data is provided to the address reconstructor 220. The address reconstructor 220 removes the address offset 214 from the fetched memory address 222 of the potential prefetch request 202 to determine the memory address 216 on which the potential prefetch request 202 was based. The memory addresses 216 may be a list of recently received accessed memory addresses 204. Reconstructing the memory address 216 depends on which address offset was used to generate the prefetch address of the potential prefetch request 202. In other words, the address reconstructor 220 needs information about which address offset 214 was used to generate the potential prefetch request 202 and remove that address offset 214. As explained further below, not all of the generated address offsets 214 will result in prefetch requests issued to the memory system. Since multiple potential prefetch requests 202 may be based on the same accessed memory address 204, the same memory address 216 may be reconstructed multiple times.
Another example of an address offset generator that may be one of the offset generators 206 employed in the multi-trained scalable prefetcher 200 is a second-best offset (2BO) generator 206(B). The 2BO generator 206(B) determines a second-best address offset 214(B), based on the recent request table 218, which is the next most likely, after the best address offset 214(A), to produce the accessed memory address 204 based on memory addresses 216 of recently prefetched data.
In another example, the multi-trained scalable prefetcher 200 may include an offset generator 206(C), which generates an address offset corresponding to a next line in memory. The next line address offset 214(C) is added to the accessed memory address 204 of the newly received request, and the resulting address is used to generate a potential prefetch request 202 for the next cache line in the memory. The next line address offset may vary depending on a size of a cache line in a processor or memory system.
Still another example in
The prefetch generator 210 receives the memory address 204 and a plurality of address offsets 214 and generates potential prefetch requests 202 directed to one or more prefetch addresses, where each of the prefetch addresses is determined based on the memory address 204 and one of the plurality of address offsets 214 according to each of the offset generators 206(A)-(D).
With continuing reference to
It can be expected that, even though the plurality of address offsets are generated by different respective algorithms, some of the potential prefetch requests 202 will be directed to the same prefetch address (i.e., two or more of the offset generators 206(A)-(D) may independently generate prefetch requests that target the same actual memory address). In some situations, redundant prefetch requests (e.g., based on the same or similar address offsets from more than one of the offset generators 206) may be resolved or reduced to a single prefetch request. Alternatively, multiple redundant prefetch requests may be generated.
Generating multiple prefetch requests to various memory locations, which may be non-overlapping, partially overlapping, or fully overlapping, increases a probability that data requested by the processor will be present in the L2 cache when a request is issued for such data. In this manner, the multi-trained scalable prefetcher 200 provides a performance improvement to the processor. It should be understood that the use of the multi-trained scalable prefetcher 200 is not limited to the example discussed above involving L2 caches and may extend to a prefetcher for a cache or memory at any level of a memory system. Though some or most of the data received due to the plurality of prefetches may not ever be accessed by the processor, the unneeded data can be easily purged and overwritten.
However, another consequence of generating multiple prefetch requests to several different addresses is an increase in congestion on the data interface over which the prefetch request buffer transmits the potential prefetch requests 202 to the memory system. Thus, at times it may be beneficial to stop sending potential prefetch requests 202 to the prefetch request buffer entirely or to scale the number of prefetch requests up or down (e.g., increase or decrease the number of potential prefetch requests 202 provided to the prefetch request buffer) according to a level of activity on the data interface (e.g., how busy the data interface is). Control of the number of potential prefetch requests 202 provided to the prefetch request buffer is scaled by a prefetch scaling circuit 226. Thus, the prefetch scaling circuit 226 receives the activity level indicator 208, indicating a level of activity on the data interface to the memory system and determines how many of the potential prefetch requests 202 will actually be forwarded. The scaled prefetch requests 228 are the potential prefetch requests 202 that are forwarded by the prefetch scaling circuit 226 based on the activity level indicator 208.
The prefetch scaling circuit 226 may receive potential prefetch requests 202 directed to prefetch addresses that are based on the address offsets from all of the offset generators 206 but may only send some or none of them to the prefetch request buffer depending on the activity level indicator 208. The prefetch scaling circuit 226 determines which of the potential prefetch requests 202 should be sent in the situation of a higher activity level on the data interface. In this regard, the prefetch scaling circuit 226 considers priorities 230 associated with each of the offset generators 206 and determines which of the potential prefetch requests 202 will be included in the scaled prefetch requests 228 provided to the prefetch request buffer based on the respective priorities associated with each of the offset generators 206(A)-206(D).
In other words, in times of low activity on the data interface, the scaled prefetch requests 228 may include most or all of the potential prefetch requests 202. Whereas, in times of high activity on the data interface, the number of scaled prefetch requests 228 is reduced to a smaller number of potential prefetch requests 202 based on address offsets from higher priority offset generators 206. In some cases, none of the potential prefetch requests 202 are sent to the prefetch buffer. Depending on data interface activity level, the scaled prefetch requests 228 may include all, some or none of the potential prefetch requests 202. Priorities 230 for the offset generators 206 may be received from an external source, preset in advance, or dynamically programmed in the prefetch scaling circuit 226. In the case of redundant requests to a same prefetch address, the request from a higher priority address offset generator may be provided to the prefetch request buffer.
The data interface may also be referred to as a data bus or mesh. In one example, an activity level indicator is referred to as CBUSY. In one example, the activity level indicator 208 may be a two-bit binary signal having a total of four (4) possible states, each indicating a level of activity of the data interface at a given time. At each of the activity levels 0-3, the prefetch scaling circuit 226 determines which prefetch requests to forward to the prefetch request buffer, depending on the respective priority levels of the offset generators 206. For example, with the activity level indicator 208 at level 0, prefetch requests may be issued based on the address offsets from all the offset generators 206. In contrast, at level 3, only prefetch requests based on the highest priority offset generator 206 (e.g., BO generator 206(A)) are issued. In some examples, no prefetch requests are issued when the activity level indicator 208 is at the highest level (e.g., level 3).
The prefetch scaling circuit 226 provides prefetch address information 232 to the address reconstructor 220, indicating which of the potential prefetch requests 202 were sent to the prefetch request buffer so the address reconstructor 220 can determine which address offset to subtract from the fetched memory address 222.
There may be any number of address offset generators employed in the multi-trained scalable prefetcher 200. Thus, the number of activity levels indicated by the activity level indicator may not correspond to the number of address offsets generated. For reasons that may be applicable to prefetch requests and/or for other reasons, a finer resolution of the activity level indication may be needed. In one exemplary aspect, the level indicated by the activity level indicator may be tracked over time, and such historical information, in conjunction with other environmental information in the processor or memory system, may be used to determine more activity levels (e.g., with higher resolution). Prefetch requests may be controlled according to the priorities of the address request generators and the additional activity levels in any suitable manner.
The system bus 510 may be busy with communications between other devices coupled to the system bus 510. As illustrated in
The computing system 502 may also be configured to access the display controller(s) 525 over the system bus 510 to control information sent to one or more displays 528. The display controller(s) 525 sends information to display(s) 528 to be displayed via one or more video processors 530, which process the information to be displayed into a format suitable for the display(s) 528. The display(s) 528 can include any type of display, including, but not limited to, a cathode ray tube (CRT), a liquid crystal display (LCD), a plasma display, a light emitting diode (LED) display, etc.
The memory system 516 of the processor-based system 500 may include a set of computer-readable instructions 532 stored in a non-transitory computer-readable medium 535 (e.g., a memory). Also, other components of the processor-based system 500 in
While the non-transitory computer-readable medium 535 is shown in an exemplary embodiment to be a single medium, the term “computer-readable medium” should be taken to include a single medium or multiple media (e.g., a centralized or distributed database and/or associated caches and servers) that store the one or more sets of instructions. The term “computer-readable medium” shall also be taken to include any medium that is capable of storing, encoding, or carrying a set of instructions for execution by the processing device and that cause the processing device to perform any one or more of the methodologies of the embodiments disclosed herein. The term “computer-readable medium” shall accordingly be taken to include, but not be limited to, solid-state memories, optical medium, and magnetic medium.
Those of skill in the art will further appreciate that the various illustrative logical blocks, modules, circuits, and algorithms described in connection with the aspects disclosed herein may be implemented as electronic hardware, instructions stored in memory or in another computer-readable medium and executed by a processor or other processing device, or combinations of both. For example, the initiator and target devices described herein may be employed in any circuit, hardware component, integrated circuit (IC), or IC chip. A processor is a circuit that can include a microcontroller, a microprocessor, or other circuits that can execute software or firmware instructions. A controller is a circuit that can include a microcontroller, a microprocessor, and/or dedicated hardware circuits (e.g., a field programmable gate array (FPGA)) that do not necessarily execute software or firmware instruction. Memory disclosed herein may be any type and size of memory and may be configured to store any type of information desired. To clearly illustrate this interchangeability, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. How such functionality is implemented depends upon the particular application, design choices, and/or design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
The various illustrative logical blocks, modules, and circuits described in connection with the aspects disclosed herein may be implemented or performed with a processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices (e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration).
The aspects disclosed herein may be embodied in hardware and in instructions that are stored in hardware and may reside, for example, in Random Access Memory (RAM), flash memory, Read Only Memory (ROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), registers, a hard disk, a removable disk, a CD-ROM, or any other form of non-transitory computer-readable medium known in the art. An exemplary storage medium is coupled to the processor such that the processor can read information from and write information to the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a remote station. In the alternative, the processor and the storage medium may reside as discrete components in a remote station, base station, or server.
It is also noted that the operational steps described in any of the exemplary aspects herein are described to provide examples and discussion. The operations described may be performed in numerous different sequences other than the illustrated sequences. Furthermore, operations described in a single operational step may actually be performed in a number of different steps. Additionally, one or more operational steps discussed in the exemplary aspects may be combined. It is to be understood that the operational steps illustrated in the flowchart diagrams may be subject to numerous different modifications, as will be readily apparent to one of skill in the art. Those of skill in the art will also understand that information and signals may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof.
The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations. Thus, the disclosure is not intended to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The present application claims priority to U.S. Provisional Patent Application Ser. No. 63/438,358, filed on Jan. 11, 2023, and entitled “MULTI-TRAINED SCALABLE PREFETCHER, AND RELATED METHODS,” which is hereby incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
63438358 | Jan 2023 | US |