The present application is related to and claims the benefit of co-pending U.S. patent application Ser. No. 11/056,673, entitled “CACHE MEMORY DIRECT INTERVENTION,” filed on the same date herewith and which is incorporated herein by reference in their entirety.
1. Technical Field
The present invention relates generally to computer memories, and in particular, to a system and method for implementing direct cache intervention across semi-private cache memory units. The present invention further relates to processing of castouts in a manner enabling victim caching across same-level cache memories deployed from hierarchically distinct cache memories.
2. Description of the Related Art
A conventional symmetric multiprocessor (SMP) computer system, such as a server computer system, includes multiple processing units all coupled to a system interconnect, which typically comprises one or more address, data and control buses. Coupled to the system interconnect is a system memory, which represents the lowest level of volatile memory in the multiprocessor computer system and which generally is accessible for read and write access by all processing units. In order to reduce access latency to instructions and data residing in the system memory, each processing unit is typically further supported by a respective multi-level cache hierarchy, the lower level(s) of which may be shared by one or more processor cores.
Cache memories are commonly utilized to temporarily store values that might be accessed by a processor in order to speed up processing by reducing access latency introduced by having loading needed values from memory. In some multiprocessor (MP) systems, the cache hierarchy includes at least two levels. The level one (L1), or upper-level cache is usually a private cache associated with a particular processor core and cannot be accessed by other cores in an MP system. The processor core first looks for a data in the upper-level cache. If the requested data is not found in the upper-level cache, the processor core then access lower-level caches (e.g., level two (L2) or level three (L3) caches) for the requested data. The lowest level cache (e.g., L3) is often shared among several processor cores.
At the not fully shared levels of memory, (typically one or more of the upper levels such as L1, L2, and L3 cache levels within a given cache hierarchy), the cache memory is directly accessible by its the processor core and other cache units that are part of the same hierarchy. For upper level cache units outside the given hierarchy and system memory, the given cache is not directly accessible but must instead be accessed by a shared bus transaction in which read and write requests are placed on a shared bus and retrieved and responded to by lower level memory or intervention snooping.
There is a need for a more intelligent system and method for managing a multi-level memory hierarchy to reduce unnecessary memory bus traffic and latency. There is also a need to improve utilization of cache memories included in hierarchies having non-utilized processors.
The present invention addresses these and other needs unresolved by the prior art.
It is therefore one object of the invention to provide an improved method for handling cache operations in a multiprocessor computer system.
It is another object of the present invention to provide such a method that enables direct cache intervention across multiple same-level caches that reside in different cache hierarchies.
It is yet another object of the present invention to provide a computer system that leverage the direct intervention method to provide fully accessible victim caching across caches residing in different cache hierarchies.
The foregoing objectives are achieved in a method, system, and device for enabling intervention across same-level cache memories as disclosed herein. In a preferred embodiment, a direct intervention request is sent from the first cache memory to a second cache memory requesting a direct intervention that satisfies a data access request sent from a processor core to the first cache memory. In another embodiment, the present invention provides a direct castin technique combined with the direct intervention to enable victim caching across same-level cache memories deployed from hierarchically distinct cache memories.
The above as well as additional objects, features, and advantages of the present invention will become apparent in the following detailed written description.
The novel features believed characteristic of the invention are set forth in the appended claims. The invention itself however, as well as a preferred mode of use, further objects and advantages thereof, will best be understood by reference to the following detailed description of an illustrative embodiment when read in conjunction with the accompanying drawings, wherein:
The present invention is generally directed to cache intervention and, more specifically, to an improved cache intervention technique whereby private cache memories directly access other private caches without the need for shared interconnect request processing.
With reference now to the figures, wherein like reference numerals refer to like and corresponding parts throughout, and in particular with reference to
In the depicted embodiment, each processing node 102 is realized as a multi-chip module (MCM) containing four processing units 104a-104d, each preferably realized as a respective integrated circuit. The processing units 104 within each processing node 102 are coupled for communication to each other and system interconnect 110 by a local interconnect 114, which, like system interconnect 110, may be implemented, for example, with one or more buses and/or switches.
The devices attached to each local interconnect 114 include not only processing units 104, but also one or more memory controllers (not depicted), each providing an interface to a respective system memory 108 (depicted in
Those skilled in the art will appreciate that SMP data processing system 100 can include many additional non-illustrated components, such as interconnect bridges, non-volatile storage, ports for connection to networks or attached devices, etc. Because such additional components are not necessary for an understanding of the present invention, they are not illustrated in
Referring now to
The operation of each processor core 200 is supported by a multi-level volatile memory hierarchy having at its lowest level shared system memory 108, and at its upper levels one or more levels of cache memory, which in the illustrative embodiment include a store-through level one (L1) cache 226 within and private to each processor core 200, and a respective level two (L2) cache 230, which, as explained in further detail below, is semi-private to its respective core and is accessible via the direct intervention technique of the present invention. L2 cache 230 includes an L2 array and directory 234, a master 232 and a snooper 236. Master 232 initiates transactions on local interconnect 114 and system interconnect 110 and accesses L2 array and directory 234 in response to memory access (and other) requests received from the associated processor core 200. Snooper 236 snoops operations on local interconnect 114, provides appropriate responses, and performs any accesses to L2 array and directory 234 required by the operations.
Although the illustrated cache hierarchies includes only two levels of cache, those skilled in the art will appreciate that alternative embodiments may include additional levels (L3, L4, etc.) of on-chip or off-chip in-line or lookaside cache, which may be fully inclusive, partially inclusive, or non-inclusive of the contents the upper levels of cache.
Each processing unit 104 includes an integrated I/O (input/output) controller 214 supporting the attachment of one or more I/O devices. As discussed further below, I/O controller 214 may issue read and write operations on its local interconnect 114 and system interconnect 110, for example, in response to requests by attached I/O device (not depicted).
As further illustrated in
With reference now to
In its conventional role, arbiter logic 305 arbitrates the order of processing of memory access requests from core 200 and interconnect 114. Memory access requests, including load and store operations, are forwarded in accordance with the arbitration policy implemented by arbiter 305 to a dispatch pipe 306 where each read and write request is processed with respect to directory 308 over a given number of cycles. The direct intervention module 250 depicted in
As further shown in
L2 cache 230 further includes an RC queue 320 and a CPI (castout push intervention) queue 318 that buffer data being inserted and removed from the cache array 302. RC queue 320 includes a number of buffer entries that each individually correspond to a particular one of RC machines such that each RC 312 that is dispatched retrieves data from only the designated buffer entry. Similarly, CPI queue 318 includes a number of buffer entries that each individually correspond to a particular one of the castout machines 310 and snoop machines 236, such that each CO machine 310 and each snooper 236 that is dispatched retrieves data from only the respective designated CPI buffer entry.
Referring to
Following release of the load from dispatch pipe 306, continued processing of the command depends on availability of one of RC machines 312 for processing the command. As shown at steps 408, 410, and 422, the processing of the load operation terminates if no RC machine 312 is available. Otherwise, an available RC machine 312 is dispatched to handle the load operation as depicted at steps 408 and 412. A pass indicator signals a successfully dispatched RC (step 414) so that the load is not re-issued. If the requested cache line is in array 302 and is verified by the coherence state read from directory 308 as valid, the RC machine 312 signals the third multiplexer M3 to return the data to core 200 as shown at steps 416 and 418. Processing of the cache hit concludes with the dispatched RC machine 312 being de-allocated or released as shown at steps 420 and 422.
Following successful dispatch of the CO machine (step 442), arbiter 305 reads the victim cache line out of array 302 to a CPI (castout push intervention) queue 318 (step 444) in preparation for the victim line to be placed in a lower level cache or system memory. Responsive to both the victim line being read out to CPI queue 318 at step 444 (if a castout was necessary) and also the read data being forwarded at step 436, the data is transferred from RC queue 320 into the appropriate line in array 302, as shown at steps 437 and 446. After the data is transferred from RC queue 320 into the appropriate line in array 302, the RC machine is deallocated and the read process terminates as depicted at steps 420 and 422.
Returning to castout processing, the CO machine 310 issues a request to fabric controller 316 for the victim line to be pushed from CPI queue 318 to the lower level memory via interconnect 114 (step 448). The victim line push is processed and completed followed by the CO machine being released as shown at steps 450, 451, and 452.
The present invention provides an improved intervention method by which caches, such as L2 caches 230a and 230b which are otherwise private to their respective cores, can perform what is referred to herein as a “direct intervention” in response to a cache miss. As will be depicted and explained with reference to the following figures such direct intervention is performed speculatively (i.e. in parallel with the memory access operation in the host cache) and reduces the likelihood of having to process a shared bus request responsive to a cache miss.
As explained with reference to
Following release of the load from dispatch pipe 306a, continued processing of the load operation depends on availability of one of RC machines 312a for processing the command. As shown at steps 610, 612, 614, and 628, processing of the load operation terminates if no RC machine 312 is available. Furthermore, an RC dispatch failure results in arbiter 305a issuing a direct intervention cancellation signal (not depicted) to the L2.1 arbiter 305b (step 614) resulting in L2.1 cache 230b canceling further processing of the direct intervention request.
Otherwise, as shown at step 610 and 616, an available RC machine 312 is dispatched to handle the L2.0 load operation. A pass indicator signals a successfully dispatched RC (step 618) so that the load is not re-issued. If the requested cache line is in L2.0 array 302a and is verified by the coherence state read from directory 308a as valid, RC machine 312a signals the third multiplexer M3 to return the data to core 200a as shown at steps 620 and 624. Given the successful load, arbiter 305a issues a direct intervention cancellation signal to the L2.1 arbiter 305b (step 622) to cancel further L2.1 cache 230b processing of the direct intervention request. Processing of the cache hit concludes by deallocating the dispatched RC machine 312a as shown at steps 626 and 628.
Next is described the steps performed by the L2.0 cache 230a responsive to a miss at step 620 in accordance with the direct intervention mechanism and technique of the present invention. As shown on
If a fast or slowNACK has been received by the L2.0 arbiter 305a (step 630) and the L2.0 cache 230a misses at step 620, the load operation processing commences in the conventional manner. Namely, a read request is issued onto interconnect 114 as shown at step 642. The assigned RC machine 312a issues the read request on interconnect 114 and waits for return of the requested data into RC queue 320a that buffers incoming cache lines to be placed in array 302a. Once the data is returned to RC queue 320a (step 644) the data is forwarded to processor core 200a via M3 (step 645).
If a castout was not required (step 646), the castout process ends as shown at step 660. If a castout is required in accordance with congruence class occupancy or otherwise, RC machine 312a issues a castout request via M1 to arbiter 305a and dispatch pipe 306a, which dispatches one of CO machines 310a to handle the castout, as illustrated at steps 646 and 650. RC machine 312a may have to repeat the castout request until a CO machine 310a is available and successfully dispatched (steps 650 and 652).
Following successful dispatch of the CO machine 310 (step 652), arbiter 305a reads the victim cache line out of array 302a to CPI queue 318a (step 654) in preparation for the victim line to be placed in a lower level cache or system memory. Responsive to both the victim line being read out to CPI queue 318a at step 654 (if a castout was required) and also the read data being forwarded at step 645, the data buffered in the RC queue 320a is transferred into the appropriate line in array 302a as shown at steps 647 and 648. Finally, RC machine 312a is released as shown at step 626 and the read process concludes at step 628.
Returning to castout processing, the CO machine 310a issues a request to fabric controller 316 for the victim line to be pushed from CPI queue 318a to the lower level memory via interconnect 114 (step 656). The victim line push is processed and completed and the CO machine 310a released as shown at steps 658, 659, and 660.
If, however, a fast and a slow positive acknowledgement was received by L2.0 arbiter 305a as shown at 630 and 632, the process continues with arbiter 305a sending the push request to L2.1 CPI queue 318b (step 634). The request preferably includes the tag or other identifier of the L2.1 snoop machine 236b that was dispatched by arbiter 305b responsive to the direct intervention request (explained further with reference to
Referring to
As shown at step 712, and referring back to blocks 614 and 622 of
In response to a L2.1 cache miss, arbiter 305b sends a SLOW NACK to arbiter 305a to tenninate the direct intervention process and signal the L2.0 cache 230a to proceed with a typical shared bus load request and de-allocates snoop machine 236b allocated in step 718 as shown at steps 722, 723, 716, and 744. Otherwise, responsive to a cache hit at step 722, the direct intervention process continues with arbiter 305b sending a SLOW ACK to L2.0 arbiter 305a including the tag identifier of the snoop machine 236b dispatched at block 718. Next, as illustrated at step 726, L2.1 arbiter 305b reads the cache line from cache array 302b into the buffer entry of CPI queue 318b corresponding to the dispatched snoop machine 236b.
Proceeding as shown at steps 728 and 730, when CPI queue 318b receives the request sent as shown at block 634 from L2.0 arbiter 305a with the snoop tag identifier, the data sent to the buffer entry in RCQ 320a corresponding to the L2.0 RC machine 312a handling the load operation. Having thus directly transferred the data without undertaking a shared bus transaction, the direct intervention process ends as shown at steps 732 and 734 with the L2.1 snoop machine 236b being deallocated.
The present invention further provides an improved castout processing method and mechanism that enables a cache unit included in a memory hierarchy of a processor core to castout “sideways” to another same-level cache unit that is otherwise within the private memory hierarchy of another core and which may serve as a victim cache under certain circumstances. Referring to
The invention is applicable to castout operations results from load or store operations and
Following release of the store from dispatch pipe 306, continued processing of the command depends on availability of one of RC machines 312 for processing the command. As shown at steps 808, 810, and 822, the processing of the store operation terminates if no RC machine 312 is available. Otherwise, an available RC machine 312 is dispatched to handle the store operation as depicted at steps 808 and 812. A pass indicator signals a successfully dispatched RC (step 814) so that the store is not re-issued. If the requested cache line is in array 302 and is verified by the coherence state read from directory 308 as valid and exclusive to the cache, the data is store merged in array 302 as shown at steps 816 and 818. Processing of the cache hit concludes with the dispatched RC machine 312 being de-allocated or released as shown at steps 820 and 822.
In the case of a true miss, and as depicted at step 832, the assigned RC machine 312 issues a read with intent to modify (RWITM) request on interconnect 114 and awaits return of the requested data into an RC queue 320 that buffers incoming cache lines to be placed in array 302. As shown at step 838, if a castout from the target congruence class in array 302 is not required, the castout process ends as shown at step 852. If a castout is required in accordance with congruence class occupancy or otherwise, RC machine 312 issues a castout request via M1 to arbiter 305 and dispatch pipe 306, which dispatches one of CO machines 310 to handle the castout, as illustrated at steps 838 and 840. The relative instruction processing responsibilities usually dictate that there are a greater number of RC machines 312 than CO machines 310. RC machine 312 therefore repeats the castout request until a CO machine 310 is available and successfully dispatched (steps 840 and 842).
Following successful dispatch of the CO machine (step 842), arbiter 305 reads the victim cache line out of array 302 to a CPI (castout push intervention) queue 318 (step 444) in preparation for the victim line to be placed in a lower level cache or system memory. Responsive to both the victim line being read out to CPI queue 318 at step 844 (if a castout was necessary) and the data being returned to the RCQ at step 834, the data read from the RC queue 320 to the L2 (step 846) and the store data is merged into the appropriate line in array 302, as shown at step 847.
Returning to castout processing, the CO machine 310 issues a request to fabric controller 316 for the victim line to be pushed from CPI queue 318 to the lower level memory via interconnect 114 (step 848). The victim line push is processed and completed followed by the CO machine being released as shown at steps 850, 851 and 852.
The present invention provides an improved castout/castin method by which caches, such as L2 caches 230a and 230b which are otherwise private to their respective cores, can perform parallel victim caching in response to a cache miss necessitating a castout. In addition to providing a fast and high-capacity victim cache among same-level cache memories (i.e. L2-to-L2) without having to process a shared bus request, the invention facilitates maximum utilization of memory resources in a multiprocessor system in which each core has its direct (i.e. non snooped) access to a respective hierarchy.
Referring to
Following successful dispatch of the CO machine (step 910), L2.0 arbiter 305a reads the victim cache line out of array 302a to CPI queue 318 (step 912) in preparation for the victim line to be selectively placed in a lower level cache or system memory as in conventional castout operations or in the L2.1 cache 230b in accordance with the invention. Responsive to the victim line being read out to CPI queue 318a, the read or write data buffered in the RC queue 320a is placed in the appropriate line in array 302a at step 914 which has been described, and the L2.0 CO machine 310a issues a request to fabric controller 316 for the victim line to be pushed from CPI queue 318a (step 916).
In accordance with the invention, the push request from L2.0 CO machine 310a depends on whether L2.0 caches 230a and L2.1 cache 230b are presently operating in the parallel victim cache mode of the present invention. For example, the parallel victim cache mode may be prompted by one of the cores (the 200b core is the presently described embodiment) being faulty or otherwise rendered non-functional. In such a case, the memory hierarchy directly associated with the non-functioning core (the L2.1 cache 230b in presently described embodiment) is available as a victim cache to accept castouts from the same-level cache unit (the L2.0 cache 230a in the present embodiment). In a preferred embodiment, fabric controller 316 may read a flag in a configuration register 332 that indicates whether or the cache units 230a and 230b are operating in parallel victim cache mode.
If, for example and as depicted at steps 918 and 920, parallel victim cache mode is not enabled in terms of L2.1 cache 230b operating in castin mode as indicated by configuration register 332, the castout is performed in the conventional manner in which the victim data is pushed to lower level memory via interconnect 114 and the castout concludes with the L2.0 castout machine 310a de-allocated (steps 920, 924, 926, and 928). If the configuration register 332 indicates that L2.1 cache 230b is operating in victim castin mode, fabric controller 316 sends a castin request to the L2.1 op select MUX M1. L2.1 cache 230b then processes the castin request as now depicted and described in
With reference to
Following release of the castin request from dispatch pipe 306b, continued processing of the command depends on availability of one of the L2.1 RC machines 312b for processing the command. As shown at step 1010, the process continues until an RC machine 312b is available.
Once an available RC machine 312b is dispatched to handle the request (step 1012), the RC machine 312b determines at step 1015 if a CO machine 310b is required to evict the cache block in victim cache 230b chosen to accept the castin. If no such CO machine is necessary, RC machine 312b sends a request to arbiter 305b to retrieve the L2.0 castout data from the L2.0 CPI queue 318a in accordance with the L2.0 CO tag received in the original castin request from fabric controller 316 (step 1014) and arbiter 305b signals CPI queue 318a with the tag to effectuate the transfer (step 1016).
Once the L2.0 castout data is available in the L2.1 RCQ (step 1017), L2 cache array 302b is updated as depicted at step 1018. The castin data process then continues with L2.1 arb 305b signaling the CO data transfer is complete (step 1019), deallocating L2.1 RC 312b (step 1020), and concluding as depicted at step 1032.
Returning to step 1015, if however, it is determined that a CO machine 310b is required to evict the cache block in victim cache 230b chosen to accept the castin, the process continues to step 1022 which depicts RC 312b issuing a CO request though mux M1 to dispatch a castout machine. Once CO machine 310b is dispatched (step 1024), arbiter 305b reads the selected cache line out of cache array 302b in CPI buffer 318b (step 1026). Once the cache line being castout of victim cache 320b has been read into CPI buffer 318b, the process continues at step 1016 to complete the castin data transfer as described above.
In addition, the process continues to steps 1028 and 1030 which depict the eviction of the selected line from L2.1 victim cache 230b to system memory via interconnect 114 and the process concludes as shown at step 1032.
It should be noted that the aforementioned direct intervention embodiments depicted and described with reference to
While the invention has been particularly shown and described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention. These alternate implementations all fall within the scope of the invention.
Number | Name | Date | Kind |
---|---|---|---|
5895495 | Arimilli et al. | Apr 1999 | A |
6226713 | Mehrotra | May 2001 | B1 |
20060155792 | Inoue et al. | Jul 2006 | A1 |
Number | Date | Country | |
---|---|---|---|
20060184742 A1 | Aug 2006 | US |