1. Field
The present disclosure relates generally to cache architecture in a computing system and, more specifically, to a method and apparatus for pushing data into a processor cache.
2. Description
The execution time of programs that have large code and/or data footprints is significantly affected by the overhead of retrieving data from the memory system. The memory overhead may substantially increase the total execution time. Modern processors typically implement prefetches in hardware in order to anticipatorily fetch data into the processor caches. Prefetching hardware associated with a processor tracks spatial and temporal access patterns of memory accesses and issues anticipatory requests to system memory on behalf of the processor. This helps in reducing the latency of a memory access when the program executing on the processor actually requires the data. For this disclosure, the word “data” will refer to both instructions and traditional data. Due to the prefetch, the data can be found in cache with a latency that is usually much smaller than system memory access latency. Typically, such prefetching hardware is distributed with each processor. If not all processors (e.g., a digital signal processor (DSP)) in a computing system have prefetching hardware, such processors will not be able to perform hardware-based prefetches. This results in an imbalance of performance among processors.
The features and advantages of the present disclosure will become apparent from the following detailed description of the present disclosure in which:
An embodiment of the present invention comprises a method and apparatus for using a centralized pushing mechanism to push data into a processor cache. For example, a memory controller may be adapted to act as the centralized pushing mechanism to push data into a processor cache in either a single-processor computing system or a multiple-processor computing system. The centralized pushing mechanism may comprise request prediction logic to predict a processor's requests of code/data based on this processor's memory access patterns. The centralized pushing mechanism may also comprise a prefetch data buffer to temporarily store the code/data that is predicted to be desired by a processor. Additionally, the centralized pushing mechanism may further comprise push logic to issue a push request and to actively push the code/data stored in the prefetch data buffer onto a system interconnecting bus. The target processor may accept the push request issued by the centralized pushing mechanism and claim the code/data from the system interconnecting bus. The target processor may either place the code/data into a cache of its own or discard the code/data, according to the state of cache line(s) of the code/data in its own cache and/or in caches of other processors in the system. Moreover, the push request may cause changes to the states of the cache line(s) in all caches in the system to ensure cache coherency.
Reference in the specification to “one embodiment” or “an embodiment” of the present invention means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrase “in one embodiment” appearing in various places throughout the specification are not necessarily all referring to the same embodiment.
A cache 120 may be associated with the processor 110. In one embodiment, the cache 120 may be integrated in the same integrated circuit with the processor. In another embodiment, the cache 120 may be physically separated from the processor. The cache 120 is arranged such that the processor may access code/data faster in the cache than access data in a memory 170 in the system 100. The cache 120 may comprise different levels (e.g., three levels; the processor's access latency to the first level is typically shorter than that to the second or third level; and the processor's access latency to the second level is typically shorter than that to the third level).
The computing system 100 may be coupled with a chipset 140 which may comprise a memory controller 150 (
The memory controller 150 may comprise push logic 152, a prefetch data buffer 154, and prefetch prediction logic 156. The prefetch prediction logic 156 may analyze memory access patterns of the processor 110 (both temporarily and spatially) and predict the processor's future data requests based on the processor's memory access patterns. Based on the prediction by the prefetch prediction logic, the data predicted to be desired by the processor may be moved from the memory 170 and temporarily stored in the prefetch data buffer 154. The push logic may issue a request to the processor to push the data from the prefetch data buffer 154 to the cache 120. A push request may be sent for each cache line of data to be pushed. If the processor 110 accepts the push request, the push logic 152 may put the data on the bus 130 so that the processor may claim the data from the bus; otherwise, the push logic 152 may retry issuing the push request to the processor.
The computing system 100 may run a cache coherency protocol. In one embodiment, a 4-state cache coherency protocol, MESI protocol, may be used. Under the MESI protocol, a cache line may be marked as one of four states: M (Modified), E (Exclusive), S (Shared), and I (Invalidate). The M state of a cache line indicates that this cache line was modified and the underlying data (e.g., corresponding data in the memory) is older than this cache line and thus is no longer valid. The E state of a cache line indicates that this cache line is only stored in this cache and hasn't been changed by a write access yet. The S state of a cache line indicates that this cache line may be stored in other caches of the system. The I state of a cache line indicates that this cache line is invalid. In another embodiment, a 5-state cache coherency, MOESI protocol, may be used. The MOESI protocol has one more state—O(owned)—than the MESI protocol. However, an S state in the MOESI protocol is different from an S state in the MESI protocol. Under an S state with the MOESI protocol, a cache line may be stored in other caches of the system, but was modified and is not consistent with the underlying data in the memory. The cache line can only be modified by one processor and has an O state in this processor's cache, but has an S state in other processors' caches. In the description that follows, the MOESI protocol will be used as an example cache coherency protocol. However, those skilled in the art will appreciate that the same principles can be applied to any other cache coherency protocols such as the MESI and MSI (Modified, Shared, and Invalid) cache coherency protocols.
The bus 130 in the computing system may be a front side bus (FSB) or any other type of system interconnection bus. When the push logic 152 in the memory controller 150 puts data on the bus 130, it also includes a destination identification of the data (“target ID”). A processor (e.g., the processor 110) that is connected to the bus 130 and whose ID matches the target ID of the pushed data may claim the data from the bus. In one embodiment, the bus may have a “push” function, under which the address portion of a bus transaction may include a field indicating whether the “push” function is enabled (e.g., value 1 means enabled and value “0” means disabled); and if the “push” function is enabled, a field or a portion of a field may be used to indicate a destination identification of the pushed data (“target ID”). The bus with the “push” function may also provide a command (e.g., Write_Line) to perform cache line writes on the bus. Thus, when the “push” field is set during a Write_Line transaction, a processor on the bus will claim the transaction if the target ID provided with the transaction matches the processor's own ID. Once the transaction is claimed by the targeted processor, the push logic 152 of the memory controller 150 may provide data from the prefetch data buffer 154 into the cache 120.
When the processor 110 claims the pushed cache line from the bus 130, the processor may or may not decide to place the cache line into the cache 120 such that the cache coherency is not disrupted. The processor 110 needs to check whether the cache line is present in the cache (i.e., whether the data is new to the cache or not). If the cache line is new to the cache 120, the processor may place the cache line into the cache; otherwise, the processor needs to further check the state of the cache line in the cache 120. If the cache line in the cache 120 is in the I state, the processor 110 may replace this cache line with the one claimed from the bus; and otherwise, the processor 110 will discard the claimed cache line without writing it into the cache 120.
Although a single-processor computing system, which may use a memory controller to push data into a processor cache, is illustrated in
In block 225, a decision whether the processor accepts the push request issued in block 220 may be made. The “push” field of the cache line write transaction may be set (i.e., the “push” function is enabled) and the target ID may be included in the transaction. This cache line write transaction with “push” may be claimed by the processor if the processor's own ID matches the target ID in the transaction. If the processor does not accept the push request, a retry instruction may be made in block 230 so that the push request may be reissued in block 220. If the processor accepts the push request, a cache line of data to be pushed may be put on a bus, which connects the memory controller and the processor, as a write data transaction in block 235. The target ID may be included in the write data transaction. Here it is assumed that write operation with “push” is executed as a split transaction having a request phase and data phase. However, it is possible to have an interconnect that supports immediate write operation with “push”, where the push data is provided during or immediately after the address (request) phase.
In block 245, the cache of the processor may be checked to see if the claimed cache line is present. If the claimed cache line is new (i.e., not present in the cache) to the cache, on one hand, the claimed cache line is placed in the cache with its state being set as E in block 260. If the claimed cache line is present in the cache, on the other hand, the state of the cache line present in the cache may be further checked. If the state is I (i.e., invalid), this cache line in the cache is replaced with the claimed cache line with its state being set as E in block 250. If the state of the cache line in the cache is M, O, E, or S (i.e., a hit for the processor), the claimed data may be discarded by the processor in block 255, without changing the state of the cache line in the cache.
Although a full cache line push is assumed in the above description, a person of ordinary skill in the art will appreciate the disclosed techniques and readily apply them to any partial cache line push, with or without modifications.
The memory controller 150 may comprise push logic 152, a prefetch data buffer 154, and prefetch prediction logic 156. In the system 300, the prefetch prediction logic 156 may analyze memory access patterns (both temporarily and spatially) of all the processors, 110A through 110N, and may predict each processor's future data requests based on its memory access patterns. Based on such predictions, data that is likely be requested by each processor may be moved from the memory 170 and temporarily stored in the prefetch data buffer 154. The push logic may issue a request to push the data from the prefetch data buffer 154 to a cache of a requesting processor. One push request per cache line of data to be pushed may be issued. A push request including the identification of a target processor (“target ID”) may be sent to all processors via the bus 130, but only the targeted processor whose identification matches the target ID needs to respond to the push request. If the targeted processor accepts the push request, the push logic 152 may put the cache line on the bus 130 so that the targeted processor may claim the cache line from the bus; otherwise, the push logic 152 may retry issuing the push request to the targeted processor. When multiple processors are collaborating with each other and performing the same task, the prefetch prediction logic may make a global prediction what data is likely to be needed by all the processors. Based on such a global prediction, data that is likely needed by all the processors may be pushed to caches of all the processors (e.g., the data is broadcasted to all the processors) by the push logic 152.
Similar to what is described along with
In block 420, a decision whether a targeted processor accepts the push request issued in block 416 may be made. The “push” field of the cache line write transaction may be set (i.e., the “push” function is enabled) and the target ID may be included in the transaction. This cache line write transaction with “push” may be claimed by the processor if the processor's own ID matches the target ID in the transaction. If the targeted processor does not accept the push request, a retry instruction may be made in block 424 so that the push request may be reissued in block 416. If the targeted processor accepts the push request, the cache line of data to be pushed may be put on a bus, which connects the memory controller and the processor, as a write data transaction in block 428. Here it is assumed that write operation with “push” is executed as a split transaction having a request phase and data phase. However, it is possible to have an interconnect that supports immediate write operation with “push”, where the push data is provided during or immediately after the address (request) phase. Before deciding to place the claimed cache line into a cache of the targeted processor, measures need to be taken to ensure the cache coherency among all caches of the targeted processor and non-targeted processors.
In block 436, the cache of the targeted processor may be checked to see if the pushed cache line claimed from the bus is present. If the claimed cache line is present in the cache, on one hand, the state of the cache line in the cache may be further checked. If the state of the cache line is M, O, E, or S (i.e., a hit for the processor), the claimed cache line may be discarded by the targeted processor in block 440; and the state of the cache line in the cache remains unchanged. If the claimed cache line is new to the cache or if it is not new but the cache line in the cache has an I state, on the other hand, further actions are performed in block 444 of
If the claimed cache line is new to caches of all the non-targeted processors, the claimed cache line may be placed in the cache of the targeted processor with its state being set as E in block 480 of
If the claimed cache line is present in a non-targeted processor cache with an E or S state and none of the non-targeted processors has the cache line in either M or O state, the claimed cache line may be used to replace its corresponding cache line in the targeted processor cache with an S state being set for the replaced cache line in block 452. In block 456, the state of the cache line in the non-targeted processor cache is changed from E to S.
If the claimed cache line is present with an M or O state in one non-targeted processor cache, this means that at least one non-targeted processor cache has a more updated version of the cache line than the memory. In this case, a request for retrying to issue a push request may be sent out in block 460. In block 464, the corresponding cache line with the M/O state may be written back from the non-targeted processor cache to a buffer in the memory controller (e.g., prefetch data buffer 154 as shown in
Although a full cache line push is assumed in the above description, a person of ordinary skill in the art can appreciated the disclosed techniques may be readily made to apply to any partial cache line push.
Although
The memories 620A and 620B both store data that are needed by processors or any other device included in the system 600. The IOH 650 provides an interface to input/output (I/O) devices in the system. The IOH may be coupled to a Peripheral Component Interconnect (PCI) bus 660. The I/O device 670 may be connected to the PCI bus. Although not shown, other devices may also be coupled to the PCI bus and the ICH.
The centralized pushing mechanism 630 may comprise push logic 632, a prefetch data buffer 634, and prefetch prediction logic 636. In the system 600, the prefetch prediction logic 636 may analyze memory access patterns (both temporarily and spatially) of all processing cores (e.g., 611A through 611M) in each processor (e.g., 610A and 610B), and may predict each processing core's future data requests based on its memory access patterns. Based on such predictions, data that is likely be requested by each processing core may be moved from a memory (e.g., 620A or 620B) and temporarily stored in the prefetch data buffer 634. The push logic 632 may issue a request to push the data from the prefetch data buffer 634 to a cache of a requesting processing core. One push request per cache line of data to be pushed may be issued. A push request including the identification of a target processing core (“target ID”) may be sent to all processing cores via the point-to-point connections (e.g., 640A or 640B), but only the targeted processing core whose identification matches the target ID needs to respond to the push request. If the targeted processing core accepts the push request, the push logic 632 may put the cache line on the point-to-point connections from which the targeted processing core may claim the cache line; otherwise, the push logic 632 may retry issuing the push request to the targeted processing core. When multiple processing cores are collaborating with each other and performing the same task, the prefetch prediction logic may make a global prediction what data is likely to be needed by those processing cores. Based on such a global prediction, data that is likely needed by those processors may be pushed to their caches by the push logic 632. Although the centralized pushing mechanism 630 is separate from the IOH 650 as shown in
Similar to what is described along with
Although an example embodiment of the disclosed techniques is described with reference to diagrams in
In the preceding description, various aspects of the present disclosure have been described. For purposes of explanation, specific numbers, systems and configurations were set forth in order to provide a thorough understanding of the present disclosure. However, it is apparent to one skilled in the art having the benefit of this disclosure that the present disclosure may be practiced without the specific details. In other instances, well-known features, components, or modules were omitted, simplified, combined, or split in order not to obscure the present disclosure.
The disclosed techniques may have various design representations or formats for simulation, emulation, and fabrication of a design. Data representing a design may represent the design in a number of manners. First, as is useful in simulations, the hardware may be represented using a hardware description language or another functional description language which essentially provides a computerized model of how the designed hardware is expected to perform. The hardware model may be stored in a storage medium such as a computer memory so that the model may be simulated using simulation software that applies a particular test suite to the hardware model to determine if it indeed functions as intended. In some embodiments, the simulation software is not recorded, captured, or contained in the medium.
Additionally, a circuit level model with logic and/or transistor gates may be produced at some stages of the design process. This model may be similarly simulated, sometimes by dedicated hardware simulators that form the model using programmable logic. This type of simulation, taken a degree further, may be an emulation technique. In any case, re-configurable hardware is another embodiment that may involve a machine readable medium storing a model employing the disclosed techniques.
Furthermore, most designs, at some stage, reach a level of data representing the physical placement of various devices in the hardware model. In the case where conventional semiconductor fabrication techniques are used, the data representing the hardware model may be the data specifying the presence or absence of various features on different mask layers for masks used to produce the integrated circuit. Again, this data representing the integrated circuit embodies the techniques disclosed in that the circuitry or logic in the data can be simulated or fabricated to perform these techniques.
In any representation of the design, the data may be stored in any form of a computer readable medium or device (e.g., hard disk drive, floppy disk drive, read only memory (ROM), CD-ROM device, flash memory device, digital versatile disk (DVD), or other storage device). Embodiments of the disclosed techniques may also be considered to be implemented as a machine-readable storage medium storing bits describing the design or the particular part of the design. The storage medium may be sold in and of itself or used by others for further design or fabrication.
While this disclosure has been described with reference to illustrative embodiments, this description is not intended to be construed in a limiting sense. Various modifications of the illustrative embodiments, as well as other embodiments of the disclosure, which are apparent to persons skilled in the art to which the disclosure pertains are deemed to lie within the spirit and scope of the disclosure.