1. Field of Invention
The present invention relates generally to design structures, and more specifically, design structures for processing systems and circuits, and more particularly to caching data in a multiprocessor system.
Processor systems typically include caches to reduce latency associated with memory accesses. A cache is generally a smaller, faster memory (relative to a main memory) that is used to store copies of data from the most frequently used main memory locations. In operation, once a cache becomes full (or in the case of a set-associative cache, once a set becomes full), subsequent references to cacheable data (in a main memory) will typically result in eviction of data previously stored in the cache (or the set) in order to make room for storage of the newly referenced data in the cache (or the set). In conventional processor systems, the eviction of previously stored data from a cache typically occurs even if the newly referenced data is not important—e.g., the newly referenced data will not be referenced again in subsequent processor operations. Consequently, in such processor systems, if the evicted data is, however, referenced in subsequent processor operations, cache misses will occur which generally results in performance slowdowns of the processor system.
Frequent references to data that may only be used once in a processor operation leads to cache pollution, in which important data is evicted to make room for transient data. One approach to address the problem of cache pollution is to increase the size of the cache. This approach, however, results in increases in cost, power, and design complexity of a processor system. Another solution to the problem of cache pollution is mark (or tag) transient data as being non-cacheable. Such a technique, however, requires prior identification of the areas in a main memory that stores transient (or infrequently used) data. Also, such a rigid demarcation of data may not be possible in all cases.
In general, in one aspect, this specification describes a method for caching data in a multiprocessor system including a first processor and a second processor. The method includes generating a memory access request for data, in which the data is required for a processor operation associated with the first processor. The method further includes, responsive to the data not being cached within a first cache associated with the first processor, snooping a second cache associated with the second processor to determine whether the data has previously been cached in the second cache as a result of an access to that data from the first processor. Responsive to the data being cached within the second cache associated with the second processor, the method further includes passing the data from the second cache to the first processor.
In general, in one aspect, this specification describes a multiprocessor system including a first processor including a first cache associated therewith, a second processor including a second cache associated therewith, and a main memory to store data required by the first processor and the second processor. The main memory is controlled by a memory controller that is in communication with each of the first processor and the second processor through a bus, and the second cache associated with the second processor is operable to cache data from the main memory corresponding to a memory access request of the first processor.
In general, in one aspect, this specification describes a computer program product, tangibly stored on a computer readable medium, for caching data in a multiprocessor system, in which the multiprocessor system includes a first processor and a second processor. The computer program product comprises instructions to cause a programmable processor to monitor a cache miss rate of the first processor, and cache data requested by the second processor within a first cache associated with the first processor responsive to the cache miss rate of the first processor being low.
In another aspect, a design structure embodied in a machine readable storage medium for designing, manufacturing, and/or testing a design for caching data in a multiprocessor system is provided. The design structure includes a multiprocessor system, which includes a first processor including a first cache associated therewith, a second processor including a second cache associated therewith, and a main memory to store data required by the first processor and the second processor, the main memory being controlled by a memory controller that is in communication with each of the first processor and the second processor through a bus, wherein the second cache associated with the second processor is operable to cache data from the main memory corresponding to a memory access request of the first processor.
Implementations can provide one or more of the following advantages. The techniques for caching data in a multiprocessor system provide a way to extend the available caches in which data (required by a given processor in a multiprocessor system) may be stored. For example, in one implementation, unused portions of a cache associated with a first processor (in the multiprocessor system) are used to store data that is requested by a second processor. Further, the techniques described herein permits more aggressive software and hardware prefetches in that data corresponding to a speculatively executed path can be cached within a cache of an adjacent processor to reduce cache pollution should a predicted path be due to a mispredicted branch. This also provides a way to cache data for the alternate path. As another example where prefetching can be made more aggressive, the hardware prefetcher can be enhanced to recognize eviction of cache lines that are used later. In these cases, the hardware prefetcher can indicate that prefetch data should be stored in a cache associated with a different processor. Similarly, when there is likelihood of cache pollution, software prefetches placed by a compiler can indicate via special instruction fields that the prefetched data should be placed in a cache associated with a different processor. In addition, the techniques are scalable according to the number of processors within a multiprocessor system. The techniques can also be used in conjunction with conventional techniques such as victim caches and cache snarfing to increase performance of a multiprocessor system. The implementation can be controlled by the operating system and hence be made transparent to user applications.
The details of one or more implementations are set forth in the accompanying drawings and the description below. Other features and advantages will be apparent from the description and drawings.
Like reference symbols in the various drawings indicate like elements.
The present invention relates generally to processing systems and circuits and more particularly to caching data in a multiprocessor system. The following description is presented to enable one of ordinary skill in the art to make and use the invention and is provided in the context of a patent application and its requirements. The present invention is not intended to be limited to the implementations shown but is to be accorded the widest scope consistent with the principles and features described herein.
The multiprocessor system 100 further includes a main memory 108 that stores data required by the processor 102 and the processor 104. The processor 102 includes a cache 110, and the processor 104 includes a cache 112. In one implementation, the cache 110 is operable to cache data (from the main memory 108) that is to be processed by the processor 102, as well as cache data that is to be processed by the processor 104. In like manner, (in one implementation) the cache 112 is operable to cache data that is to be processed by the processor 104, as well as cache data that is to be processed by the processor 102. The cache 110 and/or the cache 112 can be an L1 (level 1) cache, an L2 (level 2) cache, or a hierarchy of cache levels. In one implementation, the decision of whether to store data from main memory 108 within the cache 110 or the cache 112 is determined by a controller 114. In one implementation, the controller 114 is a cache coherency controller (e.g., in the North Bridge) operable to manage conflicts and maintain consistency between the caches 110, 112 and the main memory 108.
If, however, the data requested by the first processor is not cached in a cache associated with the first processor—i.e., there is a cache miss—then a determination is made (e.g., by controller 114) using conventional snooping mechanisms whether the data requested by the first processor is cached in a cache (e.g., cache 112) associated with a second processor (e.g., processor 104) (step 208). If the data requested by the first processor is cached in a cache associated with the second processor, then the memory access request is satisfied (step 210. The difference from conventional techniques is that the cache associated with the second processor might have data in it that the second processor did not request using a load instruction or prefetch. The memory access request can be satisfied by the cache (associated with the second processor) forwarding the data to the pipelines and/or register file of the first processor. In one implementation, the data stored in the cache associated with the second processor is moved or copied to the cache associated with the first processor. In such an implementation, an access threshold can be set (e.g., through the controller 114) that indicates the number of accesses of the data that is required prior to the data being moved from the cache associated with the second processor to the cache associated with the first processor. For example, if the access threshold is set at “1”, then the very first access of the data in the cache associated with the second processor will prompt the controller to move the data to the cache associated with the first processor. If in step 208 the data requested by the first processor is not cached in a cache associated with the second processor (or any other processor in the multiprocessor system), the data is retrieved from a main memory (e.g., main memory 108) (step 212).
The data retrieved from the main memory is dynamically stored in a cache associated with the first processor or a cache associated with the second processor based on a type (or classification) of the memory access request (step 214). In one implementation, the data retrieved from the main memory is stored in a cache of a given processor based on a type of priority associated with the memory access request. For example, (in one implementation) low priority requests for data of the first processor are stored in a cache associated with a second processor. Accordingly, in this implementation, cache pollution of the first processor is avoided. A memory access request from a given processor can be set as a low priority request through a variety of suitable techniques. More generally, the memory access requests (from a given processor) can be classified (or assigned a type) in accordance with any pre-determined criteria.
In one implementation, a (software) compiler examines code and/or an execution profile to determine whether software prefetch (cache or stream touch) instructions will benefit from particular prefetch requests being designated as low priority requests - e.g., the compiler can designate a prefetch request as a low priority request if the returned data is not likely to be used again by the processor in a subsequent processor operation or if the returned data will likely cause cache pollution. In one implementation, the compiler sets bits in a software prefetch instruction, which indicate that the returned data (or line) should be placed in a cache associated with another processor (e.g., an L2 cache of an adjacent processor). The returned data can be directed to the cache associated with the other processor by the controller 114 (
In one implementation, hardware prefetch logic associated with a given processor is designed to recognize when data (associated with a prefetch request) returned from main memory evicts important data from a cache. The recognition of the eviction of important data can serve as a trigger for the hardware prefetch logic to set bits to designate subsequent prefetch requests as low priority requests. Thus, returned data associated with the subsequent prefetch requests will be placed in a cache associated with another processor. In one implementation, speculatively executed prefetches and memory access—e.g., as a result of a branch prediction—are designated as low priority requests. Such a designation prevents cache pollution in the case of incorrectly speculated executions which are not cancelled prior to data being returned from a main memory. Thus, data corresponding to an alternate path—i.e., a path that is eventually determined to have been incorrectly predicted—can be cached (in the second processor's cache). Such caching of data corresponding to the alternate path can, in some cases, reduce data access times on a subsequent visit to the branch, if the alternate path is taken at that time.
Referring first to
Referring to
In one implementation, the cache coherency controller 320 sets bits associated with the data stored in the L2 cache 316 that indicate the number of times that the data has been accessed by the processor 302. Further, in this implementation, a user can set a pre-determined access threshold that indicates the number of accesses of the data (of the processor 302) that is required prior to the data being copied from the L2 cache 316 to a cache associated with the processor 302—i.e., the L1 cache 310 or the L2 cache 312. Thus, for example, if the access threshold is set to 1 for a given line of data stored in the L2 cache 316, then the very first access of the line of data in the L2 cache 316 will prompt the cache coherency controller 320 to move the line of data from the L2 cache 316 to a cache associated with the processor 302. In like manner, if the access threshold is set to 2, then the second access of the line of data in the L2 cache 316 by the processor 302 will prompt the cache coherency controller 320 to copy the line of data from the L2 cache 316 to a cache associated with the processor 302. In this implementation, a user can control an amount of cache pollution by tuning the access threshold. The user can consider factors including cache coherency, inclusiveness, and the desire to keep cache pollution to a minimum when establishing access thresholds for cached data.
In one implementation, an operating system can be used to monitor the load on individual processors within a multiprocessor system and their corresponding cache utilizations and cache miss rates to control whether the cache coherency controller should enable data corresponding to a low priority request of a first processor to be stored within a cache associated with a second processor. For example, if the operating system detects that the cache associated with a second processor is being underutilized—or the cache miss rate of the cache is low—then the operating system can direct the cache coherency controller to store data requested by the first processor within the cache associated with a second processor. In one implementation, the operating system can dynamically enable and disable data corresponding to a low priority request of a first processor to be stored within a cache associated with a second processor in a transparent manner during operation.
One or more of method steps described above can be performed by one or more programmable processors executing a computer program to perform functions by operating on input data and generating output. Generally, the techniques described above can take the form of an entirely hardware implementation, or an implementation containing both hardware and software elements. Software elements include, but are not limited to, firmware, resident software, microcode, etc. Furthermore, some techniques described above may take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system.
Design process 410 may include using a variety of inputs; for example, inputs from library elements 430 which may house a set of commonly used elements, circuits, and devices, including models, layouts, and symbolic representations, for a given manufacturing technology (e.g., different technology nodes, 32 nm, 45 nm, 90 nm, etc.), design specifications 440, characterization data 450, verification data 460, design rules 470, and test data files 485 (which may include test patterns and other testing information). Design process 410 may further include, for example, standard circuit design processes such as timing analysis, verification, design rule checking, place and route operations, etc. One of ordinary skill in the art of integrated circuit design can appreciate the extent of possible electronic design automation tools and applications used in design process 410 without deviating from the scope and spirit of the invention. The design structure of the invention is not limited to any specific design flow.
Design process 410 preferably translates a circuit as described above and shown in
For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. The medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk. Current examples of optical disks include compact disk—read only memory (CD-ROM), compact disk—read/write (CD-R/W) and DVD.
Various implementations for caching data in a multiprocessor system have been described. Nevertheless, various modifications may be made to the implementations described above, and those modifications would be within the scope of the present invention. For example, method steps discussed above can be performed in a different order and still achieve desirable results. Also, in general, method steps discussed above can be implemented through hardware logic, or a combination of software and hardware logic. The techniques discussed above can be applied to multiprocessor systems including, for example, in-order execution processors, out-of-order execution processors, both programmable and non-programmable processors, processors with on-chip or off-chip memory controllers and so on. Accordingly, many modifications may be made without departing from the scope of the present invention.
This application is a continuation-in-part of co-pending U.S. patent application Ser. No. 11/566,187, filed Dec. 1, 2006, which is herein incorporated by reference.
Number | Date | Country | |
---|---|---|---|
Parent | 11566187 | Dec 2006 | US |
Child | 12147789 | US |