A multi-core computing system may support many applications, which may be executed as threads by cores associated with one or more processors associated with the computing system. The cores may access local caches and shared caches. The shared caches may be subject to various cache-related policies, including cache replacement policies (also referred to as cache replacement algorithms).
While some of these cache replacement algorithms, such as the least recently used (LRU) algorithm, perform well with applications that have working sets that fit within a single cache, they might not perform well in systems with large distributed system level caches being accessed by multiple threads. In addition, other cache-related algorithms, such as insertion algorithms and allocation algorithms may also perform poorly in systems with large distributed system level caches being accessed by multiple threads. Accordingly, there is a need for systems and methods for effective set sampling and set-dueling in large distributed system level caches.
In one example, the present disclosure relates to a method for selecting a cache algorithm in a system having a plurality of cores and a plurality of shared cache instances accessible to any of the plurality of cores, where the system is configurable to execute threads. The method may include a shared cache instance, from among the plurality of shared cache instances, receiving a request associated with a thread, where the request comprises policy information for specifying at least one of two cache algorithms for implementation by the shared cache instance for any requests associated with the thread.
The method may further include the shared cache instance implementing the at least one of the two cache algorithms specified by the policy information received as part of the request associated with the thread unless the shared cache instance is identified as a delegated shared cache instance, from among the shared cache instances, for determining a winner between the two cache algorithms for use with any requests associated with the thread.
In another example, the present disclosure relates to a system having a plurality of cores and a plurality of shared cache instances accessible to any of the plurality of cores, where the system is configurable to execute threads. The system may include a shared cache instance, from among the plurality of shared cache instances, to receive a request associated with a thread, where the request comprises policy information for specifying at least one of two cache algorithms for implementation by the shared cache instance for any requests associated with the thread.
The system may further include shared cache instance circuitry, associated with the shared cache instance, configured to process the policy information received as part of the request associated with the thread. The shared cache instance circuitry may further be configured to instruct the shared cache instance to implement the at least one of the two cache algorithms unless the shared cache instance is identified by the shared cache instance circuitry as a delegated shared cache instance, from among the shared cache instances, for determining a winner between the at least two cache algorithms for use with any requests associated with the thread.
In yet another example, the present disclosure relates to a method for selecting a cache algorithm in a system having a plurality of cores and a plurality of shared cache instances accessible to any of the plurality of cores, where the system is configurable to execute threads. The method may include designating a shared cache instance as a first delegated shared cache instance for determining a winner between at least two cache algorithms for access requests associated with a thread. The method may further include delegating another shared cache instance as a second delegated shared cache instance for determining the winner between the at least two cache algorithms for access requests associated with the thread.
The method may further include communicating policy information specifying the winner between the at least two cache algorithms to each of the plurality of cores. The method may further include a shared cache instance, from among the plurality of shared cache instances, upon receiving a request for cache access associated with the thread implementing one of the at least two cache algorithms specified by the policy information received as part of the request for the cache access unless the shared cache instance receiving the request is identified as the first delegated shared cache instance or the second delegated shared cache instance.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
The present disclosure is illustrated by way of example and is not limited by the accompanying figures, in which like references indicate similar elements. Elements in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale.
Examples described in this disclosure relate to systems and methods for effective set sampling and set-dueling in large distributed system level caches. Certain examples relate to systems with multiple cores in a multi-threaded computing system. The multi-threaded computing system may be a standalone computing system or may be part (e.g., a server) of a public cloud, a private cloud, or a hybrid cloud. The public cloud includes a global network of servers that perform a variety of functions, including storing and managing data, running applications, and delivering content or services, such as streaming videos, electronic mail, office productivity software, or social media. The servers and other components may be located in data centers across the world. While the public cloud offers services to the public over the Internet, businesses may use private clouds or hybrid clouds. Both private and hybrid clouds also include a network of servers housed in data centers. Applications may be executed using compute and memory resources of the standalone computing system or a computing system in a data center. As used herein, the term “application” encompasses, but is not limited to, any executable code (in the form of hardware, firmware, software, or in any combination of the foregoing) that implements a functionality, a virtual machine, a client application, a service, a micro-service, a container, or a unikernel for serverless computing. Alternatively, applications may be executing on hardware associated with an edge-compute device, on-premises servers, or other types of systems, including communications systems, such as base stations (e.g., 5G or 6G base stations).
Computing systems contain several types of memories, including caches. Caches help alleviate the long latency associated with access to main memories (e.g., double data rate (DDR) dynamic random access memory (DRAM)) by providing data with low latency. A processor may have access to a cache hierarchy, including L1 caches, L2 caches, and L3 caches, where the L1 caches may be closest to the processing cores and the L3 caches may be the furthest. Data access may be made to the caches first and if the data is found in the cache, then it is viewed as a hit. If the data, however, is not found in the cache, then it is viewed as a miss, and the data will need to be loaded from the main memory (e.g., the DRAM). Managing caches, including implementing the various cache policies, is a difficult problem in systems that include multiple cores and have shared system level caches.
Examples described herein relate to systems and methods for dynamic set sampling (DSS) and set-dueling in distributed system level caches (SLCs) shared by many cores. Certain conventional methods distribute sets across multiple shared cache instances (SCIs), but this approach reduces accuracy since there is not enough resolution to capture the performance of each thread per SCI, and the information across each SCI is not combined per thread.
In a large system hosting multiple users, shared cache resources are contested by numerous applications. These applications might have different access patterns to the system level cache (SLC), and require that the SLC manage the cache lines owned by one application differently from another. One example of cache management is a cache replacement algorithm. Different applications will prefer different cache replacement algorithms. Certain caches may implement dynamic replacement algorithms, such as dynamic re-reference interval prediction (DRRIP) and dynamic insertion policy (DIP), where the cache will change the policy to one that improves the hit-rate of the cache for the application currently being executed by the core(s).
Dynamic replacement algorithms may accomplish this by dynamic set sampling (DSS) and set-dueling. DSS leverages the understanding that the behavior of a small portion of the cache is statistically sufficient to approximate the behavior of the entire cache. For example, in a cache with 2048 sets, 32-64 sets (referred to as the leader sets) may be sufficient. With DSS, the cache would always perform a replacement algorithm “ReplA” on some of the leader sets (e.g., 32 leader sets), and “ReplB” on some other leader sets (e.g., another 32 leader sets) to approximate the cache behavior as if the entire cache were performing ReplA or ReplB. Set-dueling makes ReplA and ReplB compete against each other to identify the winner that provides the highest hit-rate (or equivalently, the lowest miss-rate). A dueling counter may be used to implement set-dueling. A miss in the leader sets of ReplA will decrement the dueling counter, while a miss in the leader sets of ReplB will increment the dueling counter. In this manner, the dueling counter measures the policy that generates the most misses, and thus the opposite replacement policy is chosen to maximize hits. The chosen replacement policy is implemented on the rest of the sets in the cache (referred to as the follower sets).
In the case of multi-core systems, different applications could prefer different replacement algorithms. Thread-aware (TA) dynamic replacement algorithms such as TA-DRRIP and TA-DIP can identify the optimum replacement algorithm for a particular thread by having leader sets per thread, per policy. A leader set for thread 0 and having a replacement algorithm A (ReplA) will statically implement ReplA for lines that are sourced from thread 0, while using the winning policy determined by the other thread's leader sets for all other threads. Therefore, for effective thread-aware replacement algorithms one would need 64 total leader sets for a 1-thread system per cache instance, 128 leader sets for a 2-thread system per cache instance, 256 sets for a 4-thread system, and all 2048 sets for a 32-thread system. To support higher thread counts than 32, one would need to reduce the number of leader sets, reducing accuracy, or group threads into clusters and pay a potential performance penalty.
In addition, conventional many-core systems cannot implement the traditional TA-DRRIP/TA-DIP algorithms for high core-count systems while maintaining effective sampling efficiency. One possible solution is to use information from the core's private cache to pre-emptively select the shared system level cache replacement algorithm. While this approach may work in certain situations, it has several disadvantages. First, this approach removes the information that is only present in the shared system level cache. This approach also reduces the efficiency of filtering by the cache at the locality only seen by the shared system level cache. Finally, this approach ignores the effect of multiple threads competing and co-existing in the shared cache space.
Examples described herein address the high core count sampling problem. These systems and methods also minimize the required dueling hardware per cache instance, while adding only minimal hardware to the cache controllers. Finally, these systems and methods add only one or a few more additional bits to the payload of messages being sent in the fabric. As described herein, in certain examples, the problem of having a limited number of sets in a particular shared cache instance (SCI) to identify the best algorithm for all threads is overcome by delegating the SCI to identify the set-dueling winner for only the nearest physical core/thread. Since cache accesses for a large distributed cache are spread out, to maintain sufficient dynamic set sampling the number of leader sets per SCI are increased. In one example, the number of leader sets per SCI is increased from 32-64 per 2048 sets (per thread) to around 256-512 per 2048 sets. The proposed systems and methods allow identification of potentially the best shared system level cache algorithm for a particular thread in the presence of other competing threads.
With continued reference to
The delegated shared instance (DSCI) for a specific thread is used to decide the replacement algorithm to be implemented across the entire shared cache for all accesses from the thread. Thus, in this example, SCI 0 142 will decide the replacement algorithm to be implemented across SSLC 160 for all accesses from thread 0. SCI 1 144 will decide the replacement algorithm to be implemented across SSLC 160 for all accesses from thread 1. SCI 2 146 will decide the replacement algorithm to be implemented across SSLC 160 for all accesses from thread 2. SCI N 148 will decide the replacement algorithm to be implemented across SSLC 160 for all accesses from thread N. The information for choosing the winning policy for the thread (referred to as the dynamic algorithm bit(s) (DAB)) is communicated to the physical core for thread 0 on a response/return message to the core. On receiving this response message, the core will store and transmit the respective DAB on all future command messages to any shared cache instance. If the shared cache instance that received the message, including the DAB, is the DSCI for that particular thread, then the DAB is disregarded. Otherwise, the SCI will read the DAB and perform the algorithm indicated in the message. This delegated dynamic cache arrangement absolves each shared cache instance (SCI) of the responsibility of identifying the best performing algorithm for each thread. Instead, the responsibility for identifying the best performing algorithm for a thread (e.g., thread 0) is delegated to a particular SCI (e.g., SCI 0 142 for thread 0 in this example). However, the delegated dynamic cache arrangement is not restricted to having one DSCI per core. Instead, other mappings of core(s)/thread(s) to DSCI(s) can also be used. As one example, core 0 102 may have set-dueling hardware in two different delegated shared cache instances (e.g., SCI 0 142 and SCI 1 144). Similarly core 1 104 may have set-dueling hardware in two other delegated shared cache instances (e.g., SCI 1 144 and SCI 2 146). In this manner, a single thread may rely upon two delegated shared cache instances (DSCIs) for identifying the best cache replacement algorithm for the thread. Other mappings among cores/threads and DSCIs may also be used. As an example, threads may share a delegated shared cache instance. Thus, both thread 0 and thread 1 may share one SCI (e.g., SCI 0 142).
The communication of the best algorithm is choreographed by the movement of the DABs across the system. In one example, to minimize the overhead of the communication (e.g., where the new policy has to be communicated to the core and then transmitted to the SCIs), this example proposes to choose the DSCIs for a particular thread/core by closest proximity. It is understood in this example though, that some requests already in flight from the core to other SCIs with the stale DAB will be installed sub-optimally. Although
In one example, the DAB is included as part of the metadata portion of the outbound request. On the other hand, if the answer to the query in step 204 is yes, then in step 208, the DAB is disregarded. Outbound requests, inbound requests, and other messaging may be implemented using a cache protocol, such as the coherent hub interface (CHI) protocol offered by ARM. Other messaging protocols and associated functionality may also be used.
Next, in step 210, the SCI circuitry for the SCI that received the outbound request determines whether the targeted set is a leader set. If the answer is no, then in step 212, the replacement algorithm defined by the dueling counter is implemented for that SCI with respect to the outbound request sent by the core. On the other hand, if the answer is yes, then in step 214, the static policy defined by the leader set is implemented for that SCI. Finally, in step 216, the SCI circuitry sends to the core the response along with the internal DAB policy state. Although
With continued reference to
Still referring to
With respect to example SCI 300 of
Although
Next, in step 414, the logic associated with the SCI determines whether the most significant bit (MSB) of the dueling counter changed. If the answer is no, then there is no update to the DAB and as part of step 416, the old DAB value is returned to the thread via the response message. If, however, the answer is yes, then the DAB is updated with the new state and as part of step 418, the new DAB is returned to the thread via the response message. Although
Layout 510 corresponds to a layout of leader sets and follower sets for a delegated shared cache instance (e.g., SCI 0) from the perspective of thread 0. Layout 520 corresponds to a layout of leader sets and follower sets for a delegated shared cache instance (e.g., SCI 1) from the perspective of thread 1. The legend in
As shown in layout 510, from thread 0's perspective, any time there is an access to set 0, set 2, set 4, set 6, set 8, set 10, set 12, or set 14, the cache replacement algorithm per policy A will be used. From thread 0's perspective, any time there is an access to set 1, set 3, set 5, set 7, set 9, set 11, set 13, or set 15, the cache replacement algorithm per policy B will be used. Any access to sets 16 to 32 (the follower sets in this example) would result in the implementation of the cache replacement policy determined by the dueling counter. As shown in layout 520, from thread 1's perspective, any time there is an access to set 0, set 2, set 4, set 6, set 8, set 10, set 12, or set 14, the cache replacement algorithm per policy A will be used. From thread 1's perspective, any time there is an access to set 1, set 3, set 5, set 7, set 9, set 11, set 13, or set 15, the cache replacement algorithm per policy B will be used. Any access to sets 16 to 32 (the follower sets in this example) would result in the implementation of the cache replacement policy determined by the dueling counter. The winning cache replacement policy is sent as part of the DAB included in the command messages to the shared cache instances. As explained earlier, when core0/thread0 accesses SCI 0, the winning policy bit (DAB) is collected and stored. Similarly, when core0/thread0 accesses a different SCI, the DAB with the winning policy information is sent to that SCI to implement it.
As before, policy A may be one of the dynamic replacement algorithms, such as dynamic re-reference interval prediction (DRRIP). Policy B may be another one of the dynamic replacement algorithms, such as dynamic re-reference interval prediction (DRRIP). Thus, policy A may be a static RRIP (SRRIP) policy that involves the use of fixed value for the re-reference interval across the shared cache instance. Policy B may be a bimodal RRIP (BRRIP) policy that inserts certain cache blocks with a distant re-reference interval prediction and inserts certain other cache blocks with a long re-reference interval prediction. The choice between the two could be made probabilistically, such that one of the two is chosen less frequently than the other one.
As shown in layout 540, from thread 0's perspective, any time there is an access to set 0, set 4, set 8, set 12, set 16, set 20, set 24, or set 28 of SCI 0, the cache replacement algorithm per policy A will be used. From thread 0's perspective, any time there is an access to set 1, set 5, set 9, set 13, set 17, set 21, set 25, or set 29 of SCI 0, the cache replacement algorithm per policy B will be used. As shown in layout 540, from thread 1's perspective, any time there is an access to set 2, set 6, set 10, set 14, set 18, set 23, set 26, or set 30 of SCI 0, the cache replacement algorithm per policy A will be used. From thread 1's perspective, any time there is an access to set 3, set 7, set 11, set 15, set 19, set 23, set 27, or set 31 of SCI 0, the cache replacement algorithm per policy B will be used. Any access to the sets beyond (the follower sets in this example) would result in the implementation of the cache replacement policy determined by a respective dueling counter. Since there is a dueling counter per thread, in this example two dueling counters (one for thread 0 and another for thread 1) are being used.
As shown in layout 550, from thread 2's perspective, any time there is an access to set 0, set 4, set 8, set 12, set 16, set 20, set 24, or set 28 of SCI 1, the cache replacement algorithm per policy A will be used. From thread 2's perspective, any time there is an access to set 1, set 5, set 9, set 13, set 17, set 21, set 25, or set 29 of SCI 1, the cache replacement algorithm per policy B will be used. As shown in layout 550, from thread 3's perspective, any time there is an access to set 2, set 6, set 10, set 14, set 18, set 23, set 26, or set 30 of SCI 1, the cache replacement algorithm per policy A will be used. From thread 3's perspective, any time there is an access to set 3, set 7, set 11, set 15, set 19, set 23, set 27, or set 31 of SCI 1, the cache replacement algorithm per policy B will be used. Any access to the sets beyond set 31 (the follower sets in this example) would result in the implementation of the cache replacement policy determined by a respective dueling counter. Since there is a dueling counter per thread, in this example two dueling counters (one for thread 2 and another for thread 3) are being used. The winning cache replacement policy is sent as part of the DAB included in the command messages to the shared cache instances. As explained earlier, when core0/thread0 accesses SCI 0, the winning policy bit (DAB) is collected and stored. Similarly, when core0/thread0 accesses a different SCI, the DAB with the winning policy information is sent to that SCI to implement it.
As before, policy A may be one of the dynamic replacement algorithms, such as dynamic re-reference interval prediction (DRRIP). Policy B may be another one of the dynamic replacement algorithms, such as dynamic re-reference interval prediction (DRRIP). Thus, policy A may be a static RRIP (SRRIP) policy that involves the use of fixed value for the re-reference interval across the shared cache instance. Policy B may be a bimodal RRIP (BRRIP) policy that inserts certain cache blocks with a distant re-reference interval prediction and inserts certain other cache blocks with a long re-reference interval prediction. The choice between the two could be made probabilistically, such that one of the two is chosen less frequently than the other one.
As shown in layouts 560 and 570, from thread 0's perspective, any time there is an access to set 0, set 4, set 8, or set 12 of SCI 0 or SCI 1, the cache replacement algorithm per policy A will be used. From thread 0's perspective, any time there is an access to set 1, set 5, set 9, or set 13 of SCI 0 or SCI 1, the cache replacement algorithm per policy B will be used. As shown in layouts 560 and 570, from thread 1's perspective, any time there is an access to set 2, set 6, set 10, or set 14 of SCI 0 of SCI 1, the cache replacement algorithm per policy A will be used. From thread 1's perspective, any time there is an access to set 3, set 7, set 11, or set 15 of SCI 0 or SCI 1, the cache replacement algorithm per policy B will be used. Any access to the sets beyond set 31 (the follower sets in this example, which are not shown) would result in the implementation of the cache replacement policy determined by a respective dueling counter. Since there is a dueling counter per thread, in this example four dueling counters (one for thread 0, one for thread 1, one for thread 2, and one for thread 3) are being used.
As shown in layouts 560 and 570, from thread 2's perspective, any time there is an access to set 16, set 20, set 24, or set 28 of SCI 0 or SCI 1, the cache replacement algorithm per policy A will be used. From thread 2's perspective, any time there is an access to set 17, set 21, set 25, or set 29 of SCI 0 or SCI 1, the cache replacement algorithm per policy B will be used. As shown in layouts 560 and 570, from thread 3's perspective, any time there is an access to set 18, set 22, set 26, or set 30 of SCI 0 of SCI 1, the cache replacement algorithm per policy A will be used. From thread 3's perspective, any time there is an access to set 19, set 23, set 27, or set 31 of SCI 0 or SCI 1, the cache replacement algorithm per policy B will be used. Any access to the sets beyond set 31 (the follower sets in this example, which are not shown) would result in the implementation of the cache replacement policy determined by a respective dueling counter. Since there is a dueling counter per thread, in this example four dueling counters (one for thread 0, one for thread 1, one for thread 2, and one for thread 3) are being used. The winning cache replacement policy is sent as part of the DAB included in the command messages to the shared cache instances. As explained earlier, when core0/thread0 accesses SCI 0 or SCI 1, the winning policy bit (DAB) is collected and stored. Similarly, when core0/thread0 accesses a different SCI, the DAB with the winning policy information is sent to that SCI to implement it.
In this example, the inbound request message is also in a packet form and includes a number of bits. Those bits include bits labeled as: REQ.ADDR/ATTR, REQ. DAB [N:0], and REQ. THREADINFO in
With continued reference to
Still referring to
Step 720 includes the shared cache instance implementing the at least one of the two cache algorithms specified by the policy information received as part of the request associated with the thread unless the shared cache instance is identified as a delegated shared cache instance, from among the shared cache instances, for determining a winner between the at least two cache algorithms for use with any requests associated with the thread. As an example, as described with respect to
Step 820 includes delegating another shared cache instance as a second delegated shared cache instance for determining the winner between the at least two cache algorithms for any access requests associated with the thread. In one example, the winner is determined using a second set-dueling counter associated with the second delegated shared cache instance. As explained earlier, the set-dueling counter is incremented or decremented if the request associated with the thread is accessing a leader set for the delegated shared cache instance and the request results in a cache miss. As an example, steps 404, 408, and 412 described earlier with respect to
Step 830 includes communicating policy information specifying the winner between the at least two cache algorithms to each of the plurality of cores. In one example, as described earlier with respect to
Step 840 includes a shared cache instance, from among the plurality of shared cache instances, upon receiving a request for cache access associated with the thread implementing one of the at least two cache algorithms specified by the policy information received as part of the request for the cache access unless the shared cache instance receiving the request is identified as the first delegated shared cache instance or the second delegated shared cache instance. As an example, as described with respect to
In conclusion, the present disclosure relates to a method for selecting a cache algorithm in a system having a plurality of cores and a plurality of shared cache instances accessible to any of the plurality of cores, where the system is configurable to execute threads. The method may include a shared cache instance, from among the plurality of shared cache instances, receiving a request associated with a thread, where the request comprises policy information for specifying at least one of two cache algorithms for implementation by the shared cache instance for any requests associated with the thread.
The method may further include the shared cache instance implementing the at least one of the two cache algorithms specified by the policy information received as part of the request associated with the thread unless the shared cache instance is identified as a delegated shared cache instance, from among the shared cache instances, for determining a winner between the two cache algorithms for use with any requests associated with the thread.
The method may further comprise disregarding the policy information when the shared cache instance is identified as the delegated shared cache instance. The method may further comprise implementing the winner between the at least two cache algorithms for the delegated shared cache instance if the request associated with the thread is not accessing a leader set.
The method may further comprise implementing a policy specified by leader sets for the delegated shared cache instance if the request associated with the thread is accessing a leader set. The winner is determined using a set-dueling counter, and the method may further comprise based on one of a cache hit determination or a cache miss determination incrementing or decrementing the set-dueling counter if the request associated with the thread is accessing a leader set.
The method may further comprise updating the policy information received as part of the request associated with the thread if the set-dueling counter reaches a predetermined state. The method may further comprise returning as part of a response message from the shared cache instance updated policy information to each of the plurality of cores to ensure any future requests from the thread comprises the updated policy information for specifying the at least one of the two cache algorithms for implementation by the shared cache instance.
In another example, the present disclosure relates to a system having a plurality of cores and a plurality of shared cache instances accessible to any of the plurality of cores, where the system is configurable to execute threads. The system may include a shared cache instance, from among the plurality of shared cache instances, to receive a request associated with a thread, where the request comprises policy information for specifying at least one of two cache algorithms for implementation by the shared cache instance for any requests associated with the thread.
The system may further include shared cache instance circuitry, associated with the shared cache instance, configured to process the policy information received as part of the request associated with the thread. The shared cache instance circuitry may further be configured to instruct the shared cache instance to implement the at least one of the two cache algorithms unless the shared cache instance is identified by the shared cache instance circuitry as a delegated shared cache instance, from among the shared cache instances, for determining a winner between the at least two cache algorithms for use with any requests associated with the thread.
The system may further be configured to disregard the policy information when the shared cache instance is identified as the delegated shared cache instance. The system may further be configured to implement the winner between the two cache algorithms for the delegated shared cache instance if the request associated with the thread is not accessing a leader set.
The system may further be configured to implement a policy specified by leader sets for the delegated shared cache instance if the request associated with the thread is accessing a leader set. The winner is determined using a set-dueling counter, and the system may further be configured to, based on one of a cache hit determination or a cache miss determination, increment or decrement the set-dueling counter if the request associated with the thread is accessing a leader set.
The system may further be configured to update the policy information received as part of the request associated with the thread if the set-dueling counter reaches a predetermined state. The system may further be configured to return as part of a response message from the shared cache instance updated policy information to each of the plurality of cores to ensure any future requests associated with the thread comprises the updated policy information for specifying the at least one of the two cache algorithms for implementation by the shared cache instance.
In yet another example, the present disclosure relates to a method for selecting a cache algorithm in a system having a plurality of cores and a plurality of shared cache instances accessible to any of the plurality of cores, where the system is configurable to execute threads. The method may include designating a shared cache instance as a first delegated shared cache instance for determining a winner between at least two cache algorithms for access requests associated with a thread. The method may further include delegating another shared cache instance as a second delegated shared cache instance for determining the winner between the at least two cache algorithms for access requests associated with the thread.
The method may further include communicating policy information specifying the winner between the at least two cache algorithms to each of the plurality of cores. The method may further include a shared cache instance, from among the plurality of shared cache instances, upon receiving a request for cache access associated with the thread implementing one of the at least two cache algorithms specified by the policy information received as part of the request for the cache access unless the shared cache instance receiving the request is identified as the first delegated shared cache instance or the second delegated shared cache instance.
The winner is determined using a first set-dueling counter associated with the first delegated shared cache instance or using a second set-dueling counter associated with the second delegated shared cache instance. The winner is determined using a first set-dueling counter associated with the first delegated shared cache instance, and the method may further comprise based on one of a cache hit determination or a cache miss determination incrementing or decrementing the first set-dueling counter if the request associated with the thread is accessing a leader set for the first delegated shared cache instance.
The method may further comprise implementing a policy specified by leader sets for the first delegated shared cache instance if the request associated with the thread is accessing a leader set. The winner is determined using a second set-dueling counter associated with the second delegated shared cache instance, and the method may further comprise based on one of a cache hit determination or a cache miss determination incrementing or decrementing the second set-dueling counter if the request associated with the thread is accessing a leader set for the second delegated shared cache instance. The method may further comprise implementing a policy specified by leader sets for the second delegated shared cache instance if the request associated with the thread is accessing a leader set.
It is to be understood that the methods, modules, and components depicted herein are merely exemplary. Alternatively, or in addition, the functionality described herein can be performed, at least in part, by one or more hardware logic components. For example, and without limitation, illustrative types of hardware logic components that can be used include Field-Programmable Gate Arrays (FPGAs), Application-Specific Integrated Circuits (ASICs), Application-Specific Standard Products (ASSPs), System-on-a-Chip systems (SOCs), and Complex Programmable Logic Devices (CPLDs). In an abstract, but still definite sense, any arrangement of components to achieve the same functionality is effectively “associated” such that the desired functionality is achieved. Hence, any two components herein combined to achieve a particular functionality can be seen as “associated with” each other such that the desired functionality is achieved, irrespective of architectures or inter-medial components. Likewise, any two components so associated can also be viewed as being “operably connected,” or “coupled,” to each other to achieve the desired functionality. Merely because a component, which may be an apparatus, a structure, a system, or any other implementation of a functionality, is described herein as being coupled to another component does not mean that the components are necessarily separate components. As an example, a component A described as being coupled to another component B may be a sub-component of the component B, the component B may be a sub-component of the component A, or components A and B may be a combined sub-component of another component C.
The functionality associated with some examples described in this disclosure can also include instructions stored in a non-transitory media. The term “non-transitory media” as used herein refers to any media storing data and/or instructions that cause a machine to operate in a specific manner. Exemplary non-transitory media include non-volatile media and/or volatile media. Non-volatile media include, for example, a hard disk, a solid-state drive, a magnetic disk or tape, an optical disk or tape, a flash memory, an EPROM, NVRAM, PRAM, or other such media, or networked versions of such media. Volatile media include, for example, dynamic memory such as DRAM, SRAM, a cache, or other such media. Non-transitory media is distinct from, but can be used in conjunction with transmission media. Transmission media is used for transferring data and/or instruction to or from a machine. Exemplary transmission media include coaxial cables, fiber-optic cables, copper wires, and wireless media, such as radio waves.
Furthermore, those skilled in the art will recognize that boundaries between the functionality of the above described operations are merely illustrative. The functionality of multiple operations may be combined into a single operation, and/or the functionality of a single operation may be distributed in additional operations. Moreover, alternative embodiments may include multiple instances of a particular operation, and the order of operations may be altered in various other embodiments.
Although the disclosure provides specific examples, various modifications and changes can be made without departing from the scope of the disclosure as set forth in the claims below. Accordingly, the specification and figures are to be regarded in an illustrative rather than a restrictive sense, and all such modifications are intended to be included within the scope of the present disclosure. Any benefits, advantages, or solutions to problems that are described herein with regard to a specific example are not intended to be construed as a critical, required, or essential feature or element of any or all the claims.
Furthermore, the terms “a” or “an,” as used herein, are defined as one or more than one. Also, the use of introductory phrases such as “at least one” and “one or more” in the claims should not be construed to imply that the introduction of another claim element by the indefinite articles “a” or “an” limits any particular claim containing such introduced claim element to inventions containing only one such element, even when the same claim includes the introductory phrases “one or more” or “at least one” and indefinite articles such as “a” or “an.” The same holds true for the use of definite articles.
Unless stated otherwise, terms such as “first” and “second” are used to arbitrarily distinguish between the elements such terms describe. Thus, these terms are not necessarily intended to indicate temporal or other prioritization of such elements.