This application relates generally to microprocessor technology including, but not limited to, methods, systems, and devices for controlling cache prefetching in a processor cluster having multiple processors based on congestion levels of the processor cluster.
Cache prefetching is applied in a microprocessor of a computer system to fetch instructions and data to be used from a slower memory or cache to a faster local cache to enhance execution performance of the microprocessor. Aggressive cache prefetching may provide a significant performance uplift for the microprocessor at a risk of causing cache pollution in the faster local cache that often has a limited capacity. In the context of a processor cluster (i.e., a multicore microprocessor), a large amount of traffic exists to facilitate regular memory accesses required by operations of individual processor units, which makes it difficult for the processor cluster to spare additional bandwidth to manage cache prefetching for the processor units. Cache prefetching can easily conflict with the regular memory accesses required by the operations of the processors. As such, it would be highly desirable to provide an electronic device or system that manages cache prefetching efficiently for a processor cluster having multiple processors.
Various implementations of systems, methods and devices within the scope of the appended claims each have several aspects, no single one of which is solely responsible for the attributes described herein. Without limiting the scope of the appended claims, after considering this disclosure, and particularly after considering the section entitled “Detailed Description” one will understand how the aspects of some implementations are used to monitor multiple cluster and system congestion levels and control cache prefetching in a processor cluster based on the monitored congestion levels. In some implementations, an electronic device is provided with a cache, a processing cluster having one or more processors, and prefetch throttling circuitry that is configured to determine a cluster congestion level of the processing cluster based on an extent to which data retrieval requests sent from the processors to the cache are not satisfied by the cache and control prefetch requests to the cache in accordance with a determination whether the cluster congestion level of the processing cluster satisfies predefined congestion criteria. In some implementations, an electronic device is provided with first memory, second memory, a plurality of processing clusters, and prefetch throttling circuitry that is configured to cause a respective processing cluster to limit prefetch requests from the respective processing cluster based on a system congestion level associated with the first memory and/or the second memory.
In one aspect, an electronic device includes a first processing cluster, a cache, and prefetch throttling circuitry. The first processing cluster further includes one or more processors. The cache is coupled to the one or more processors in the first processing cluster, and is configured to receive, from the one or more processors in the first processing cluster, a plurality of data retrieval requests including demand requests and prefetch requests. The prefetch throttling circuitry is coupled to the one or more processors in the first processing cluster, and is configured to determine a congestion level of the first processing cluster based on an extent to which the plurality of data retrieval requests sent from the one or more processors in the first processing cluster to the cache are not satisfied by the cache. The prefetch throttling circuitry is further configured to in accordance with a determination that the congestion level of the first processing cluster satisfies first congestion criteria that require that the congestion level of the first processing cluster is above a first cluster congestion threshold, cause a first respective processor of the one or more processors to limit prefetch requests to the cache to prefetch requests of at least a first threshold quality. The prefetch throttling circuitry is further configured to in accordance with a determination that the congestion level of the first processing cluster does not satisfy the first congestion criteria, forgo causing the one or more processors to limit prefetch requests to the cache to prefetch requests of at least the first threshold quality.
Further, in another aspect of the invention, an electronic device includes a plurality of processing clusters, first memory (e.g., a system cache coupled to the processing clusters), second memory (e.g., DRAM memory coupled to the system cache), and prefetch throttling circuitry. Each processing cluster further includes one or more respective processors. The first memory is coupled to the plurality of processing clusters, and the second memory is coupled to the plurality of processing clusters. The second memory is configured to receive data retrieval requests sent from the plurality of processing clusters to the first memory that are not satisfied by the first memory. The prefetch throttling circuitry is coupled to the one or more respective processors in each of the plurality of processing clusters. The electronic device is configured to obtain a current congestion level of the first memory based on a number of outstanding in-flight requests received by the first memory, and maintain a first congestion level history that includes the obtained current congestion level of the first memory. The electronic device is also configured to obtain a current congestion level of the second memory based on a number of outstanding in-flight requests received by the second memory, and maintain a second congestion level history that includes the obtained current congestion level of the second memory. The prefetch throttling circuitry is configured to cause a respective processing cluster to limit prefetch requests from the respective processing cluster based on at least one of the obtained current congestion level of the first memory and the obtained current congestion level of the second memory.
These illustrative embodiments and implementations are mentioned not to limit or define the disclosure, but to provide examples to aid understanding thereof. Additional embodiments are discussed in the Detailed Description, and further description is provided there. Other implementations and advantages may be apparent to those skilled in the art in light of the descriptions and drawings in this specification.
For a better understanding of the various described implementations, reference should be made to the Detailed Description below, in conjunction with the following drawings in which like reference numerals refer to corresponding parts throughout the figures. Like reference numerals refer to corresponding parts throughout the drawings.
Reference will now be made in detail to specific embodiments, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous non-limiting specific details are set forth in order to assist in understanding the subject matter presented herein. But it will be apparent to one of ordinary skill in the art that various alternatives may be used without departing from the scope of claims and the subject matter may be practiced without these specific details.
In some implementations, memory modules 104 (e.g., memory 104 in
In some implementations, system module 100 further includes one or more components selected from:
It is noted that communication buses 140 also interconnect and control communications among various system components including components 110-122.
Further, one skilled in the art knows that other non-transitory computer readable storage media can be used, as new data storage technologies are developed for storing information in the non-transitory computer readable storage media in the memory modules 104 and in SSDs 112. These new non-transitory computer readable storage media include, but are not limited to, those manufactured from biological materials, nanowires, carbon nanotubes and individual molecules, even though the respective data storage technologies are currently under development and yet to be commercialized.
In some implementations, SoC 102 is implemented on an integrated circuit that integrates one or more microprocessors or central processing units, memory, input/output ports and secondary storage on a single substrate. SoC 102 is configured to receive one or more internal supply voltages provided by PMIC 118. In some implementations, both the SoC 102 and PMIC 118 are mounted on a main logic board, e.g., on two distinct areas of the main logic board, and electrically coupled to each other via conductive wires formed in the main logic board. As explained above, this arrangement introduces parasitic effects and electrical noise that could compromise performance of the SoC, e.g., cause a voltage drop at an internal voltage supply. Alternatively, in some implementations, SoC 102 and PMIC 118 are vertically arranged in an integrated semiconductor device, such that they are electrically coupled to each other via electrical connections that are not formed in the main logic board. Such vertical arrangement of SoC 102 and PMIC 118 can reduce a length of electrical connections between SoC 102 and PMIC 118 and avoid performance degradation caused by the conductive wires of the main logic board. In some implementations, vertical arrangement of SoC 102 and PMIC 118 is facilitated in part by integration of thin film inductors in a limited space between SoC 102 and PMIC 118.
In an example, first processing cluster 202-1 includes first processor 204-1, ...., N-th processor 204-N, first cluster cache 212-1, and first throttler 216-1, where N is an integer greater than 1. First cluster cache 212-1 has one or more first request queues 214-1, and each first request queue includes a queue of demand requests and prefetch requests received from a subset of processors 204 of first processing cluster 202-1. In some embodiments, SOC 102 only includes a single processing cluster 202-1. Alternatively, in some embodiments, SOC 102 includes at least an additional processing cluster 202, e.g., M-th processing cluster 202-M. M-th processing cluster 202-M includes first processor 206-1, ...., N′-th processor 206-N′, M-th cluster cache 212-M, and M-th throttler 216-M, where N′ is an integer greater than 1 and M-th cluster cache 212-M has one or more M-th request queues 214-M.
In some implementations, the one or more processing clusters 202 are configured to provide a central processing unit for an electronic device and are associated with a hierarchy of caches. For example, the hierarchy of caches includes three levels that are distinguished based on their distinct operational speeds and sizes. For the purposes of this application, a reference to “the speed” of a memory (including a cache memory) relates to the time required to write data to or read data from the memory (e.g., a faster memory has shorter write and/or read times than a slower memory), and a reference to “the size” of a memory relates to the storage capacity of the memory (e.g., a smaller memory provides less storage space than a larger memory). The core cache 218, cluster cache 212, and cache 220 correspond to a first level (L1) cache, a second level (L2) cache, and a third level (L3) cache, respectively. Each core cache 218 holds instructions and data to be executed directly by a respective processor 204, and has the fastest operational speed and smallest size among the three levels of memory. For each processing cluster 202, the cluster cache 212 is slower operationally than the core cache 218 and bigger in size, and holds data that is more likely to be accessed by processors 204 of respective processing cluster 202. The cache 220 is shared by the plurality of processing clusters 202, and bigger in size and slower in speed than each core cache 218 and cluster cache 212. In each processing cluster 202, respective throttler 216 monitors a system congestion level associated with memory accesses to cache 220 and memory 104 and a local cluster congestion level associated with cluster cache 212, and controls prefetches of instructions and data to core caches 218 and/or cluster cache 212 based on the system and/or cluster congestion levels. Each individual processor 204 further monitors a processor congestion level to control prefetches of instructions and data from respective cluster cache 212 into respective individual core cache 218.
In some implementations, first cluster cache 212-1 of first processing cluster 202-1 is coupled to a single processor 204-1 in the same processing cluster, and not to any other processors (e.g., 204-N). In some implementations, first cluster cache 212-1 of first processing cluster 202-1 is coupled to multiple processors 204-1 and 204-N in the same processing cluster. In some implementations, first cluster cache 212-1 of first processing cluster 202-1 is coupled to the one or more processors 204 in the same processing cluster 202-1, and not to processors in any cluster other than the first processing cluster 202-1 (e.g., processors 206 in cluster 202-M). In such cases, first cluster cache 212-1 of first processing cluster 202-1 is sometimes referred to as a second-level cache.
In each processing cluster 202, each request queue 214 optionally includes a queue of demand requests and prefetch requests received from a subset of processors 204 of respective processing cluster 202. Each data retrieval request received from respective processor 204 is distributed to one of request queues 214. In some implementations, a request queue 214 receives only requests received from a specific processor 204. In some implementations, a request queue 214 receives requests from more than one processor 204 in processing cluster 202, allowing a request load to be balanced among the plurality of request queues 214. Specifically, in some situations, a request queue 214 receives only one type of data retrieval requests (e.g., prefetch requests) from different processors 204 in the same processing cluster 202.
Each processing cluster 202 includes or is coupled to one or more prefetchers 208 in processors 204, and the prefetch requests are generated and processed by one or more prefetchers 208. In some implementations, each processor 204 in processing cluster 202 includes or is coupled to a respective prefetcher 208. In some implementations, two or more of processors 204 in processing cluster 202 share the same prefetcher 208.
In each processing cluster 202, cluster cache 212 further includes a throttler 216 (also called prefetch throttling circuitry) that is coupled to an output of cluster cache 212, request queues 214 in cluster cache 212, and one or more processors 204 of processing cluster 202. On a cluster level, throttler 216 monitors a local cluster congestion level of corresponding processing cluster 202 based on signals received from request queues 214. Specifically, throttler 216 determines a congestion level of processing cluster 202 based on an extent to which the plurality of data retrieval requests sent from one or more processors 204 in processing cluster 202 to cluster cache 212 are not satisfied by cluster cache 212. In accordance with a determination that the congestion level of processing cluster 202 satisfies first congestion criteria that require that the congestion level of processing cluster 202 is above a first cluster congestion threshold, throttler 216 causes a first respective processor (e.g., processor 204-1) of one or more processors 204 to limit prefetch requests to cluster cache 212 to prefetch requests of at least a first threshold quality (i.e., to limit the prefetch requests to high quality prefetches). Specifically, in an example, throttler 216 transmits a signal or other information to processors 204 (e.g., prefetcher 208-1 in processors 204-1) to enable prefetch throttling, so that only prefetch requests of at least the first threshold quality are sent to cluster cache 212. This optionally corresponds to a second prefetch throttling mode M2, which is different from a first prefetch throttle mode and limits prefetching by processors 204 from cluster cache 212 to prefetch requests of at least the first threshold quality 304 in
Alternatively, in accordance with a determination that the congestion level of processing cluster 202 does not satisfy the first congestion criteria (e.g., the congestion level of processing cluster 202 is below the first cluster congestion threshold), throttler 216 forgoes causing the one or more processors to limit prefetch requests to cluster cache 212 to prefetch requests of at least the first threshold quality. For example, throttler 216 forgoes causing processors 204 to limit prefetch requests to cluster cache 212 entirely, such that no prefetch requests, of any quality, are limited. This optionally corresponds to the first prefetch throttling mode M1, in which prefetching of processors 204 from cluster cache 212 is not limited by throttler 216 as explained with reference to
In some implementations, a congestion level below the first cluster congestion threshold indicates a low degree of congestion in cluster cache 212, and a congestion level above the first cluster congestion threshold indicates one or more higher degrees of congestion. If the one or more higher degrees of congestion correspond to a single high degree of congestion, the congestion level above the first cluster congestion threshold indicates this high degree of congestion. In contrast, if the one or more higher degrees of congestion correspond to a set of degrees of congestion (e.g., medium, high, and very high), the congestion level above the first cluster congestion threshold is associated with any degree in the set of degrees of congestion. More details on cluster congestion thresholds are discussed below with reference to
Further, in some implementations, on a system level, throttler 216 monitors a system congestion level of a memory system coupled to processing cluster 202 based on a system busy level signal received from the output of cluster cache 212. The system busy level signal includes information of outstanding in-flight requests that are received and not satisfied by cache 220 or memory 104. Specifically, throttler 216 obtains a current congestion level of cache 220 based on a number of outstanding in-flight requests received by cache 220, and maintains a first congestion level history (e.g., a history 402 in
In some implementations, in accordance with a determination that the congestion level of processing cluster 202 satisfies second congestion criteria, different from the first congestion criteria, that require that the congestion level of processing cluster 202 is above a second cluster congestion threshold 308 that is above the first cluster congestion threshold 302, throttler 216 causes the first respective processor 204-1 to limit prefetch requests to prefetch requests of at least a second threshold quality 310 that is higher than the first threshold quality 304. In some implementations, if the congestion level of processing cluster 202 is above second cluster congestion threshold 308 (e.g., indicating high congestion as opposed to low or medium congestion), throttler 216 causes at least a respective processor 204 (e.g., first respective processor 204-1) of processing cluster 202 to operate in a third prefetch throttling mode M3 in which prefetching is limited to prefetches of at least the second threshold quality 310 (e.g., allowing only prefetches that are at least very high quality prefetches). In contrast, in first prefetch throttling mode M1, prefetching is not limited, and in a second prefetch throttling mode M2, prefetching is limited to prefetches having a quality between the first and second threshold qualities 304 and 310 (e.g., allowing prefetches that are at least high quality prefetches).
In some implementations, in accordance with a determination that the congestion level of processing cluster 202 satisfies third congestion criteria, throttler 216 causes the first respective processor 204-1 to forgo transmitting (312) prefetch requests to the cache entirely, e.g., without regard to a quality of a requested prefetch. Stated another way, if the third congestion criteria are satisfied, throttler 216 causes at least a respective processor 204 of processing cluster 202 to operate in a fourth prefetch throttling mode M4 (also called a throttle all mode). In some implementations, in the fourth prefetch throttling mode M4, all prefetching is disabled, i.e., no prefetching is implemented for cluster cache 212 or corresponding core caches 218.
Additionally, in some implementations, the third congestion criteria include (1) a first requirement that the congestion level of processing cluster 202 is above the cluster congestion threshold 308 and (2) a second requirement that a system congestion level history 310 of electronic device 200 satisfies a first system congestion condition 316 (e.g., 75% of a system congestion level history is high). The system congestion level history 310 is monitored by throttler 216 based on a system busy level signal received from cache 220, thereby indicating a congestion level of cache 220. For example, the system congestion level history 310 is filled with “H” or “L” based on a plurality of sampled values of the system busy level signal. The first system congestion condition 316 requires that 75% or more of the system congestion level history 310 is filled with “H” to enable the fourth prefetch throttling mode M4 (i.e., the throttle all mode). Conversely, in some embodiments, throttler 216 disables and resets the fourth prefetch throttling mode M4 when a second system congestion condition is satisfied, e.g., when 25% or less of the system congestion level history 310 is filled with “H”.
In some implementations, the extent to which the plurality of data retrieval requests, sent from processors 204 in processing cluster 202 to cluster cache 212, are not satisfied by cluster cache 212 is represented by one or more historical congestion levels for processing cluster 202. The one or more historical congestion levels are maintained in a congestion level history 318 for processing cluster 202. The congestion level of processing cluster 202 is determined based on a portion or all of the one or more historical congestion levels in the congestion level history 318. In an example, each historical congestion level in congestion level history 318 corresponds to a distinct respective period of time and represents the extent to which data retrieval requests were not satisfied by the cache during the respective period of time. The historical congestion level of processing cluster 202 may have been periodically sampled and stored in the congestion level history 318. In some implementations, a respective historical congestion level (or each respective historical congestion level) has a value selected from a predetermined set of congestion level values. For example, where two congestion levels are used, a respective historical congestion level has a first congestion level value (e.g., “low”) or a second congestion level value (e.g., “high”), e.g., defined based on first cluster congestion threshold 302. In another example, where three congestion levels are used, a respective historical congestion level has a first congestion level value (e.g., “low”), or a second congestion level value (e.g., “medium”), or a third congestion level value (e.g., “high”), e.g., defined based on cluster congestion thresholds 302 and 308. One of ordinary skill in the art will recognize that any number of congestion levels may be used, and any number of distinct congestion level values used accordingly.
In some implementations, a current cluster congestion level 318A of processing cluster 202 is determined based on a comparison with congestion level thresholds 302 and 308, and stored into congestion level history 318, e.g., in place of the oldest historic congestion level stored therein. The congestion level of processing cluster 202 is determined based on a portion or all of the congestion level history 318 including the current cluster congestion level 318A of processing cluster 202. For example, in accordance with a determination that the current cluster congestion level (e.g., equal to “high”) 318A is greater than the congestion level of processing cluster 202 (e.g., equal to “medium”), the congestion level of the processing cluster 202 is increased by one level or to the current cluster congestion level 318A. In accordance with a determination that all existing historic congestion levels (e.g., equal to “medium” or “low”) in history 318 are lower than the congestion level of the processing cluster 202 (e.g., equal to “high”), the congestion level of the processing level 202 is reduced by one level. Otherwise, the congestion level of the processing level 202 does not change. The current cluster congestion level 318 is the most recent cluster congestion level measured based on cluster congestion thresholds 302 and 308. Alternatively, in some embodiments, the first and second cluster congestion thresholds 302 and 308 are applied in conjunction with a historical congestion threshold (e.g., 10% of congestion level history 318). For example, the congestion level of processing cluster 202 satisfies the first congestion criteria if a portion (e.g., 75%) of the congestion level history 318 is above the first cluster congestion threshold 302 (i.e., has a value of “medium” or “high”) and exceeds the historical congestion threshold (e.g., 10%).
It is noted that in some implementations, the congestion level of processing cluster 202 is determined based on an extent to which the plurality of data retrieval requests sent from the one or more processors 204 in processing cluster 202 to cluster cache 212 are not satisfied by the cache 212, without regard to which of the one or more processors 204 sent the plurality of data retrieval requests. That said, the congestion level of processing cluster 202 is determined without regard to an extent to which data retrieval request(s) from a specific processor of the one or more processors 204 are not satisfied by cluster cache 212.
In some implementations, determining the congestion level of processing cluster 202 includes comparing the number of data retrieval requests, sent from the one or more processors 204 in processing cluster 202 to cluster cache 212, that are not satisfied by cluster cache 212 (e.g., also called cache misses) to one or more cache miss thresholds. Each cluster congestion threshold 302 and 308 includes a respective cache miss threshold 302′ or 308′. In some implementations, the number of cache misses by processing cluster 202 is compared to the one or more cache miss thresholds 302′ or 308′ to determine a cache miss value (e.g., low, medium, high, etc.), which is taken into account when determining the congestion level of processing cluster 202. For example, if the number of cache misses by processing cluster 202 is below a first cache miss threshold 302′, a first cache miss value (e.g., a low value) is taken into account when determining the congestion level of processing cluster 202. In another example, if the number of cache misses by processing cluster 202 is above the first cache miss threshold 302′, a second cache miss value (e.g., a medium or high value) is taken into account when determining the congestion level of processing cluster 202. In yet another example, if the number of cache misses by processing cluster 202 is above a second cache miss threshold 308′, a third cache miss value (e.g., a high value) is taken into account when determining the congestion level of processing cluster 202. In some implementations, the cache miss value is taken into account in the context of one or more historical congestion levels in a congestion level history 318 for processing cluster 202. In an example, the cache miss value defines the historical congestion levels stored in the congestion level history 318 for processing cluster 202.
Further, in some implementations, the one or more cache miss thresholds (i.e., cache miss thresholds 302′ and 308′) are determined based on a system congestion level (e.g., 410 in
In some implementations, the plurality of data retrieval requests include all data retrieval requests sent from the one or more processors 204 to cluster cache 212 within a predefined period of time, i.e., include all demand requests and all prefetch requests.
In some implementations, throttler 216 determines that a congestion level of a respective processor 204-1 or 204-N is below a processor congestion threshold 336 that is different from the congestion threshold 302 or 308 used for cluster cache 212, regardless of the congestion level of processing cluster 202, and forgoes limiting prefetch requests from respective processor 204-1 or 204-N to cluster cache 212. That said, in these embodiments, the prefetch requests from respective processor 204-1 or 204-N are not limited based on the cluster congestion level and system congestion level, when the congestion level of the respective processor is below the processor congestion threshold 336 (e.g., equal to “L”). Conversely, if the congestion level of respective processor 204-1 or 204-N is beyond processor congestion threshold 336 (e.g., equal to “H”), the prefetch requests from respective processor 204-1 or 204-N to cluster cache 212 are limited or throttled based on the congestion levels of the processing cluster and system. The congestion level of respective processor 204-1 or 204-N is determined based on an extent to which data retrieval requests sent from the respective processor 204-1 or 204-N to cluster cache 212 are not satisfied by cluster cache 212, e.g., independently of whether data retrieval requests sent to cluster cache 212 from any processors other than the respective processor 204-1 or 204-N are satisfied by cluster cache 212.
Stated another way, in some implementations, the first congestion criteria further require that the congestion level of a respective processor 204 be above processor congestion threshold 336 in order for throttler 216 to limit prefetch requests from the respective processor. In some implementations, the determination whether to limit prefetch requests from a respective processor based on whether the congestion level of the respective processor is above the processor congestion threshold 336 takes priority over other determinations regarding whether to limit prefetch requests (e.g., with respect to the first congestion criteria, second congestion criteria, and/or third congestion criteria concerning the congestion level of processing cluster 202).
In some implementations, throttler 216 maintains a processor congestion level history 334 to store historical congestion levels of each processor 204. The prefetch requests from the respective processor is limited based on the congestion level of processor 204 that is determined based on at least a portion of congestion level history 334 of this processor 204. A current congestion level of processor 204 is recorded and compared with processor congestion threshold 336, and one of a plurality of values (e.g., “L” and “H”) is determined based on a comparison result and stored as a current congestion level 334A in congestion level history 334 of this processor 204 (e.g., in place of the oldest cache miss level in history 334). In accordance with a determination that the current congestion level 334A of processor 204 indicates a higher congestion level than the congestion level of processor 202, the congestion level of processor 202 is increased by one level or to the current congestion level 334A. In accordance with a determination that the entire congestion level history 334 of processor 204 is lower than the congestion level of processor 202, the congestion level of processor 202 is reduced by one level or to the lower congestion level, e.g., from “H” to “L”.
Further, in some implementations, processor congestion threshold 336 includes a processor cache miss threshold 336′. Determining the congestion level of processor 204 includes comparing a number of data retrieval requests, sent from respective processor 204 to cluster cache 212, that are not satisfied by cluster cache 212 (i.e., cache misses) to a processor cache miss threshold 336. For example, if the number of cache misses for processor 204 is below cache miss threshold 336′, a first cache miss value (e.g., a low value) is taken into account when determining the congestion level of processor 204; if the number of cache misses for processor 204 is above cache miss threshold 336′, a second cache miss value (e.g., a medium or high value) is taken into account when determining the congestion level of processor 204. Specially, in some implementations, a current cache miss is determined for a current number of data retrieval requests that are not satisfied by cluster cache 212 during a sample duration of time. The current cache miss is compared with cache miss threshold 336, and one of a plurality of cache miss values (e.g., “L” and “H”) is determined based on a comparison result and stored as a current cache miss level 334A in congestion level history 334 of this processor 204 (e.g., in place of the oldest cache miss level in history 334). In accordance with a determination that the current cache miss level 334A of processor 204 indicates a higher congestion level than the congestion level of processor 202, the congestion level of processor 202 is increased by one level or to the current cache miss level 334A. In accordance with a determination that congestion level history 334 of processor 204 indicates a lower congestion level than the congestion level of processor 202 (e.g., all cache miss levels in the congestion level history 334 are lower than the congestion level of processor 202), the congestion level of processor 202 is reduced by one level or to the lower congestion level, e.g., from “H” to “L”.
In some implementations, the electronic device 200 includes a second processing cluster 202-M having one or more second processors 206 different from the one or more processors 204 of processing cluster 202-1. Throttler 216-1 limits prefetch requests by processing cluster 202-1, independently of whether prefetch requests from one or more second processors 206 of second processing cluster 202-M are limited. In some implementations, prefetching by second processing cluster 202-M is controlled in accordance with any of the methods for controlling prefetching described herein with respect to processing cluster 202-1. In some implementations, prefetching by second processing cluster 202-M may indirectly affect prefetching by processing cluster 202-1 by indirectly affecting system congestion; however, prefetching or prefetch throttling of second processing cluster 202-M is not directly taken into account in determining whether to limit prefetching by processing cluster 202-1.
The current congestion levels of cache 220 and memory 104 are monitored with respective sampling rates that are optionally equal to or different from each other. First and second congestion level histories 402 and 404 can store up to respective limited numbers of historical congestion levels, and the respective limited numbers are optionally equal to or different from each other. In an example, the first and second congestion level histories 402 and 404 track a first integer number of historical congestion levels of cache 220 and a second integer number of historical congestion levels of memory 104. The first and second integer numbers are optionally equal to or distinct from each other.
In some implementations, throttler 216 is configured to cause processing cluster 202 to limit prefetch requests from processing cluster 202 in accordance with a highest throttling level 420 based on first congestion level history 402 of cache 220 including the obtained current congestion level 402A of cache 220. In some situations, highest throttling level 420 is determined without regard to the obtained current congestion level of memory 104. In some implementations, whether prefetch requests from processing cluster 202 are limited in accordance with highest throttling level 420 is based on the obtained current congestion level of cache 220, on first congestion level history 402 of cache 220, and/or on a first congestion level of cache 220 that is determined based on at least a portion of first congestion level history 402 of cache 220. For example, highest throttling level 420 may be determined with reference to a first system congestion condition 316 (e.g., at least a predefined percentage of first congestion level history 402 is equal to “H”). In some implementations, congestion of cache 220, but not congestion of memory 104, determines whether prefetch requests from processing cluster 202 are limited in accordance with highest throttling level 420. Additionally, in some implementations, throttler 216 is configured to cause processing cluster 202 to limit prefetch requests in accordance with highest throttling level 420 based on the congestion levels of both processing cluster 202 and cache 220. For example, highest throttling level 420 is applied to limit prefetching, when the congestion level of processing cluster 202 is above the cluster congestion threshold 308 and first congestion level history 402 of cache 220 satisfies first system congestion condition 316. In some implementations, highest throttling level 420 corresponds to a throttle all mode M4 in which no prefetching is permitted (312).
Further, in some implementations, throttler 216 is configured to cause processing cluster 202 to limit prefetch requests from processing cluster 202 in accordance with highest throttling level 420 based on first congestion level history 402 of cache 220, e.g., based on a subset of first congestion level history 402 and/or second congestion level history 404. The subset of first congestion level history 402 includes less than all or all congestion levels stored history 402. In an example, throttler 216 causes processing cluster 202 to limit prefetch requests from processing cluster 202 based on one or more most-recently determined and recorded congestion levels of cache 220. In some implementations, the subset of first congestion level history 402 has the same number of recorded historical congestion levels (e.g., the same number of samples or entries) as second congestion level history 404.
In some implementations, throttler 216 is configured to cause processing cluster 202 to limit prefetch requests from processing cluster 202 in accordance with highest throttling level 420, e.g., to activate highest throttling level 420, based on a determination that first congestion level history 402 includes more than a first threshold number of determined congestion levels indicating a respective congestion level of cache 220 (e.g., a high congestion level “H” that is above a system congestion threshold). For example, highest throttling level 420 is activated if first congestion level history 402 (or the subset of first congestion level history 402) includes greater than a first threshold number (or alternatively, first threshold percentage) of instances where the high congestion level (e.g., “H”) was recorded for cache 220.
In some implementations, throttler 216 is configured to cause processing cluster 202 to forgo limiting prefetch requests from processing cluster 202 in accordance with highest throttling level 420, e.g., to deactivate highest throttling level 420, based on a determination that first congestion level history 402 includes less than a second threshold number of determined congestion levels indicating the respective congestion level of cache 220 (e.g., the high congestion level “H” that is above the system congestion threshold). For example, highest throttling level 420 is deactivated if first congestion level history 402 (or the subset of first congestion level history 402) includes less than a second threshold number (or alternatively, second threshold percentage) of instances where a high congestion level (e.g., “H”) was recorded for cache 220. In some implementations, the first threshold number is the same as the second threshold number (or alternatively, the first threshold percentage is the same as the second threshold percentage). In some implementations, the first threshold number is different from (e.g., greater than) the second threshold number (or alternatively, the first threshold percentage is different from the second threshold percentage). In an example, both the first and second threshold percentages are 50%. In another example, the first threshold percentage is 75%, and the second threshold percentage is 25%.
In some implementations, limiting prefetch requests from processing cluster 202 in accordance with highest throttling level 420 includes limiting all prefetch requests from processing cluster 202, e.g., in a throttle all mode M4. In accordance with highest throttling level 420, no prefetch requests from processing cluster 202 are permitted.
In some implementations, throttler 216 determines a first congestion level of cache 220 and a second congestion level of memory 104. In accordance with a determination that the obtained current congestion level 402A of cache 220 indicates a higher congestion level than the first congestion level, throttler 216 increases the first congestion level, e.g., to a next-higher level in a set of possible congestion levels. Conversely, in accordance with a determination that first congestion level history 402 indicates a lower congestion level than the first congestion level (e.g., the entire first congestion level history 402 is lower than the first congestion level), throttler 216 decreases the first congestion level. For example, in accordance with a determination that no entry in first congestion level history 402 indicates a congestion level higher than the current value of the first congestion level, throttler 216 decreases the first congestion level, e.g., to a next-lower level in the set of possible congestion levels. Similarly, in some implementations, in accordance with a determination that the obtained current congestion level 404A of memory 104 indicates a higher congestion level than (e.g., a current value of) the second congestion level, throttler 216 increases the second congestion level, e.g., to a next-higher level in the set of possible congestion levels. In accordance with a determination that second congestion level history 404 indicates a lower congestion level than the second congestion level (e.g., the entire second congestion level history 404 is lower than the second congestion level), throttler 216 decreases the second congestion level. For example, in some implementations, in accordance with a determination that no entry in second congestion level history 404 indicates a congestion level higher than the current value of the second congestion level, throttler 216 decreases the second congestion level, e.g., to a next-lower level in the set of possible congestion levels. As such, throttler 216 causes processing cluster 202 to limit prefetch requests from processing cluster 202 based on the first congestion level and the second congestion level, and the first congestion level and the second congestion level are taken into account in determining whether to limit prefetch requests in accordance with a respective throttling level that is below a highest throttling level.
In some implementations, first system congestion level 406 is determined based on the obtained current congestion level 402A of cache 220, on first congestion level history 402 of cache 220, and/or on the first congestion level of cache 220 that is determined based on at least a portion of first congestion level history 402 of cache 220. A second system congestion level 408 is determined based on the obtained current congestion level 404A of memory 104, on second congestion level history 404 of memory 104, and/or on a second congestion level of memory 104 that is determined based on at least a portion of second congestion level history 404 of memory 104. Congestion levels 406 and 408 are combined to generate a combined system congestion level 410 having two or more congestion values, such as first congestion value 326 and second congestion value 328, which are applied to determine different cache miss thresholds (i.e., cache miss thresholds 302′ and 308′). In some embodiments, the combined system congestion level 410 is equal to a greater one of congestion level 406 of cache 220 and congestion level 408 of memory 104. For example, if congestion level 406 is “L” and congestion level 408 is “H”, the combined system congestion level 410 is “H”. If congestion level 406 is “H” and congestion level 408 is “L”, the combined system congestion level 410 is still “H”.
In some implementations, a threshold quality for prefetch requests is dependent on a local cluster congestion level of cluster cache 212, in addition to the system congestion level 410 of cache 220 and/or memory 104. In accordance with a determination that the congestion level of processing cluster 202 satisfies second congestion criteria, different from the first congestion criteria, that require that the congestion level of processing cluster 202 is above a second cluster congestion threshold 308 that is above the first cluster congestion threshold 302, throttler 216 causes the first respective processor 204 to limit prefetch requests to cluster cache 212 to prefetch requests of at least a second threshold quality 310 that is higher than the first threshold quality 304. In some implementations, a first threshold quality 304 (e.g., high-quality prefetch) is selected from a first set of quality thresholds 502 based on the system congestion level 410, and a second threshold quality 310 (e.g., very high-quality prefetch) is selected from a second set of quality thresholds 510 based on the system congestion level 410. In the second set of quality thresholds 510, first system congestion level 504 is higher than third system congestion level 508 and lower than second system congestion level 506, and a first value (QVHM) of second threshold quality 310 corresponding to first system congestion level 504 is less than a second value (QVHH) of second threshold quality 310 corresponding to second system congestion level 506 and greater than a third value (QVHL) of second threshold quality 310 corresponding to third system congestion level 508. For the same system congestion level, e.g., 504, first value (QVHM) of second threshold quality 310 is also higher than first value (QHM) of first threshold quality 304 because the local cluster congestion level of cluster cache 212 is higher in association with second threshold quality 310.
Additionally, in each processor 204, respective prefetcher 208 is associated with a subset of or all of the following data:
Prefetch throttling circuitry determines (704) a congestion level of first processing cluster 202-1 based on an extent to which the plurality of data retrieval requests sent from one or more processors 204 in first processing cluster 202-1 to cache 212-1 are not satisfied by cache 212-1. The plurality of data retrieval requests optionally include all data retrieval requests sent from one or more processors 204 to cache 212-1 within a predefined period of time. In some implementations, the congestion level of first processing cluster 202-1 is determined based on an extent to which the plurality of data retrieval requests sent from one or more processors 204 in first processing cluster 202-1 to cache 212-1 are not satisfied by cache 212-1, without regard to which of one or more processors 204 sent the plurality of data retrieval requests.
In some implementations, determining the congestion level of first processing cluster 202-1 includes comparing the number of plurality of data retrieval requests, sent from one or more processors 204 in first processing cluster 202-1 to cache 212-1, that are not satisfied by cache 212-1 to one or more cache miss thresholds (e.g., thresholds 302′ and 308′ in
In accordance with a determination that the congestion level of first processing cluster 202-1 satisfies first congestion criteria that require that the congestion level of first processing cluster 202-1 is above a first cluster congestion threshold 302, the prefetch throttling circuitry causes (706) a first respective processor 204-1 of one or more processors 204 to limit prefetch requests to cache 212-1 to prefetch requests of at least a first threshold quality 304. Conversely, in accordance with a determination that the congestion level of first processing cluster 202-1 does not satisfy the first congestion criteria, the prefetch throttling circuitry forgoes (708) causing one or more processors 204 to limit prefetch requests to cache 212-1 to prefetch requests of at least the first threshold quality 304.
In some implementations, the first threshold quality 304 is selected from a set of quality thresholds based on a system congestion level of the device (e.g., a combined system congestion level 410 in
In some implementations, in accordance with a determination that the congestion level of first processing cluster 202-1 satisfies second congestion criteria, different from the first congestion criteria, that require that the congestion level of first processing cluster 202-1 is above a second cluster congestion threshold 308 that is above the first cluster congestion threshold 302, the prefetch throttling circuitry causes first respective processor 204-1 to limit prefetch requests to cache 212-1 to prefetch requests of at least a second threshold quality 310 that is higher than the first threshold quality 304. Further, in some implementations, in accordance with a determination that the congestion level of first processing cluster 202-1 satisfies third congestion criteria, different from the first congestion criteria, the prefetch throttling circuitry causes the first respective processor to forgo transmitting prefetch requests to cache 212-1, e.g., in a throttle all mode M4. Further, in some implementations, the third congestion criteria include a requirement that a system congestion level of the device (e.g., first congestion level history 402 of cache 220) satisfies a system congestion condition 316.
In some implementations, in accordance with a determination that a congestion level of a second respective processor 204-M is below a processor congestion threshold 336, regardless of the congestion level of first processing cluster 202-1, the prefetch throttling circuitry forgoes limiting prefetch requests from the second respective processor 204-M to cache 212-1, wherein the congestion level of second respective processor 204-M is determined based on an extent to which data retrieval requests sent from second respective processor 204-M to cache 212-1 are not satisfied by cache 212-1.
It is noted that in some embodiments, the first respective processor 204-1 of the one or more processors is caused to limit prefetch requests to cache 212-1 to prefetch requests of at least the first threshold quality, in accordance with a determination that a congestion level of the first respective processor 204-1 is above a processor congestion threshold 336. That said, in an example, if the congestion level of the first respective processor 204-1 is “H”, the prefetch requests from the first respective processor 204-1 are limited to at least the first threshold quality, and if the congestion level of the first respective processor 204-1 is “L”, the prefetch requests from the first respective processor 204-1 are not limited. In some embodiments, the congestion level of the first respective processor 204-1 is determined based on one or more historical congestion levels (e.g., in history 334 in
In some implementations, a second processing cluster 202-M includes one or more second processors 206 different from one or more processors 204 of first processing cluster 202-1. The prefetch throttling circuitry limits prefetch requests by first processing cluster 202-1 independently of whether prefetch requests from one or more second processors 206 of second processing cluster 202-M are limited.
The prefetch throttling circuitry causes (812) a respective processing cluster to limit prefetch requests from the respective processing cluster 202 based on at least one of the obtained current congestion level of the first memory and the obtained current congestion level of the second memory.
In some implementations, the prefetch throttling circuitry determines a respective throttling level, of a plurality of throttling levels, for respective processing cluster 202 based on a congestion level of respective processing cluster 202. Further, in some implementations, a combined system congestion level 410 is determined based on the obtained current congestion level of the first memory and the obtained current congestion level of the second memory. In an example, the combined system congestion level 410 is equal to a greater one of the obtained current congestion level of the first memory and the obtained current congestion level of the second memory. The prefetch throttling circuitry determines the respective throttling level for respective processing cluster 202 based on comparing the congestion level of respective processing cluster 202 to one or more cluster congestion thresholds 302 and 308 that vary based on the combined system congestion level 410. Further, in some implementations, the prefetch throttling circuitry causes respective processing cluster 202 to limit prefetch requests to prefetch requests of at least a respective threshold quality 304 or 310, and the respective threshold quality 304 or 310 corresponds to the respective throttling level for the respective processing cluster 202 and is determined based on the combined congestion level 410. More details on determining the threshold quality 304 or 310 are discussed above with reference to
In some implementations, the prefetch throttling circuitry causes respective processing cluster 202 to limit prefetch requests from respective processing cluster 202 in accordance with a highest throttling level 420 based on the first congestion level history 402 of the first memory including the obtained current congestion level of the first memory, e.g., in a throttle all mode M4. Further, in some implementations, the prefetch throttling circuitry causes respective processing cluster 202 to limit prefetch requests from respective processing cluster 202 based on a subset of the first congestion level history 402 and on second congestion level history 404. Additionally, in some implementations, the prefetch throttling circuitry causes respective processing cluster 202 to limit prefetch requests from respective processing cluster 202 in accordance with highest throttling level 420 based on a determination that first congestion level history 402 includes more than a first threshold number of determined congestion levels (e.g., “H”) indicating a respective congestion level of the first memory. Further, in some implementations, the prefetch throttling circuitry causes respective processing cluster 202 to forgo limiting prefetch requests from respective processing cluster 202 in accordance with highest throttling level 420 based on a determination that the first congestion level history 402 includes less than a second threshold number of determined congestion levels indicating the respective congestion level of the first memory. Further, in some implementations, limiting prefetch requests from respective processing cluster 202 in accordance with highest throttling level 420 includes limiting all prefetch requests from respective processing cluster 202, e.g., in a throttle all mode M4.
It is noted that in some implementations, limiting prefetch requests from respective processing cluster 202 according to highest throttling level 420 is also implemented based on a combination of (1) the congestion level of respective processing cluster 202 and (2) the obtained current, congestion level, first congestion level history 402, or a subset of first congestion level history 402 of the first memory (e.g., cache 220). For example, highest throttling level 420 is applied to limit prefetching, when the congestion level of processing cluster 202 is above cluster congestion threshold 308 and the first congestion level history 402 of cache 220 satisfies a first system congestion condition 316 (e.g., in which first congestion level history 402 of cache 220 includes more than a first threshold number of determined congestion levels (e.g., “H”) indicating a respective congestion level of the first memory).
In some implementations, the electronic device determines a first congestion level of the first memory (e.g., congestion level 406 of cache 220 in
It should be understood that the particular order in which the operations in
Implementation examples are described in at least the following numbered clauses:
The above description has been provided with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to be limiting to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain the principles disclosed and their practical applications, to thereby enable others to best utilize the disclosure and various embodiments with various modifications as are suited to the particular use contemplated.
The terminology used in the description of the various described implementations herein is for the purpose of describing particular implementations only and is not intended to be limiting. As used in the description of the various described implementations and the appended claims, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term “and/or” as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms “includes,” “including,” “comprises,” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Additionally, it will be understood that, although the terms “first,” “second,” etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another.
As used herein, the term “if” is, optionally, construed to mean “when” or “upon” or “in response to determining” or “in response to detecting” or “in accordance with a determination that,” depending on the context. Similarly, the phrase “if it is determined” or “if [a stated condition or event] is detected” is, optionally, construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event]” or “in accordance with a determination that [a stated condition or event] is detected,” depending on the context.
The foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the claims to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain principles of operation and practical applications, to thereby enable others skilled in the art.
Although various drawings illustrate a number of logical stages in a particular order, stages that are not order dependent may be reordered and other stages may be combined or broken out. While some reordering or other groupings are specifically mentioned, others will be obvious to those of ordinary skill in the art, so the ordering and groupings presented herein are not an exhaustive list of alternatives. Moreover, it should be recognized that the stages can be implemented in hardware, firmware, software or any combination thereof.
The present application claims priority to and is a continuation of U.S. Pat. Application Serial No. 17/591,134, filed Feb. 2, 2022 and entitled “THROTTLING SCHEMES IN MULTICORE MICROPROCESSORS,” which is incorporated herein by reference in its entirety. The ‘134 application claims priority to U.S. Provisional Pat. Application No. 63/187,232, filed May 11, 2021 and entitled “Throttling Schemes in Multicore Microprocessors,” and U.S. Provisional Pat. Application No. 63/187,241, filed May 11, 2021 and entitled “Throttling Schemes in Multicore Microprocessors,” each of which is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
63187232 | May 2021 | US | |
63187241 | May 2021 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 17591134 | Feb 2022 | US |
Child | 18155555 | US |