This disclosure relates to computing systems and related devices and methods, and, more particularly, to a method and apparatus for dynamically adapting cache size based on estimated cache performance.
The following Summary and the Abstract set forth at the end of this application are provided herein to introduce some concepts discussed in the Detailed Description below. The Summary and Abstract sections are not comprehensive and are not intended to delineate the scope of protectable subject matter, which is set forth by the claims presented below.
All examples and features mentioned below can be combined in any technically possible way.
In some embodiments, a method of dynamically adjusting sizes of cache partitions includes, for each cache partition, estimating a number of hits that would occur on the cache partition for a set of potential size increases of the cache partition and a set of potential size decreases of the cache partition. Based on these estimates, a determination is made for each cache partition, whether to increase the size of the cache partition, maintain a current size of the cache partition, or decrease the size of the cache partition. Cache partition size increases are balanced with cache partition size decreases to allocate the entirety of the cache to the set of cache partitions without over allocating cache resources and while optimizing a sum of total cache hit rates of the set of cache partitions. A set of data structures is used to efficiently determine the estimated hit increases and decreases for each cache partition.
Storage array 112 may be implemented using numerous physical drives using different types of memory technologies. In some embodiments the drives used to implement storage array 112 are implemented using Non-Volatile Memory (NVM) media technologies, such as NAND-based flash, or higher-performing Storage Class Memory (SCM) media technologies such as 3D XPoint and Resistive RAM (ReRAM). Storage array 112 may be directly connected to the other components of the storage system 100 or may be connected to the other components of the storage system 100, for example, by an InfiniBand (IB) bus or fabric.
Data clients 110 act as hosts and provide access to the storage resources provided by storage array 112. In some embodiments, data clients 110 execute in emulations 120 instantiated in the context of the storage system 100. In some embodiments, a hypervisor 122 abstracts the physical resources of the storage system 100 from emulations 120, and allocates physical resources of storage system 100 for use by the emulations 120. Each emulation 120 has an emulation operating system 122 and one or more application processes running in the context of the emulation operating system 122.
Resources meant for caching are usually shared among several beneficiaries. Workloads from distinct applications or assigned to different LUNs have different Service Level Agreements (SLAs). Example service levels may include the expected average response time for an 10 operation on the TLU, the number of 10 operations that may be performed on a given TLU, and other similar parameters. One manner in which the storage system 100 seeks to meet the SLAs for the various data clients 110 is to optimize use of the cache 118.
Cache mechanisms are crucial to computer systems such as storage arrays and compute clusters. Correctly placing data with a high probability of being requested on fast memory media can substantially reduce the response times of input/output (I/O) requests. However, the diversity and the unpredictability of the I/O stream commonly nurture the allocation of large memory areas for caching purposes. Since dynamic random-access memory (DRAM) hardware is expensive, it is important to properly asses cache sizes to improve resource utilization.
Unfortunately, physical cache resources are limited, and the optimal cache area allocation may not be apparent, which may lead to inadequate resource utilization and SLA infringement. According to some embodiments, a method is provided that identifies how the cache hit rate would be expected to change if the relative sizes of the cache partitions are changed. By determining how the cache hit rate would be affected, it is possible to enable the cache sizes to be adapted dynamically. In some embodiments, cache management system 128 (see
There is no general rule that specifies the amount of cache required to obtain a given cache hit ratio. The intrinsic dynamics of the Least Recently Used (LRU) eviction policy and the stochastic behavior of the workload makes the functioning of the cache difficult to predict. Often, cache sizes are set by experimenting with different sizes to assess performance, which may be too costly or not viable in certain situations.
Additionally, even when the cache sizes are properly established initially, the cache performance tends to be affected as the workload changes over time. The frequency of requests, their sizes, and how sequential the reads are will drastically affect how much benefit the cache can provide. For example, if a workload includes primarily sequential requests, then increasing the cache size will cause an increase in performance overall. However, if the current workload primarily contains a series of random reads, increasing the cache size will not lead to significantly greater performance. However, even if the current workload is primarily random reads, it is not prudent to reduce the cache size without information about how much a reduction would cause the cache performance to be degraded. Thus, most systems currently adopt conservative heuristics that can cause the cache sizes to be larger than necessary.
According to some embodiments, a method of estimating cache performance is provided that enables the cache sizes to be dynamically adapted over time. The technique is designed for LRU caches that share the same physical memory, e.g. when cache memory is segmented by application or LUNs. However, the technique is also applicable to any cache system that uses a Least Recently Used (LRU) policy for data eviction. For simplicity, some embodiments will be described in which the LRU policy that is being used to control operation of the cache is that a single page of memory is evicted from the cache every time a new page of memory is added to the cache. Other LRU policies may be used as well, since the particular techniques and data structures described herein can be used in connection with an LRU cache regardless of the LRU cache policy being used to determine which pages of memory are to be evicted from the cache over time.
When the physical memory is shared by several LRU caches, the optimal resource segmentation might not be obvious. Different workloads will have different requirements in terms of volume and speed (response time). Thus, the system that manages the cache (cache management system 128) needs to search for a robust strategy to split the physical memory of cache 118 to maximize overall performance and comply with the SLAs. This strategy may include decisions such as reducing the size of the cache associated to the LUN A (cache A), to open space to expand the size of the cache associated with LUN B (cache B). To make this decision, the cache management system needs to estimate how much it will lose by reducing the size of cache A by a first amount, and how much it will gain by expanding the size of cache B by a second amount. In an environment where the cache is partitioned into hundreds of individual cache allocations, the estimation should be done in a computationally efficient manner and enable simultaneous modeling of different sized increase and decreases for each of the cache partitions.
One primary measurement of performance in cache systems is the hit probability. When a piece of data residing on a given address of the storage system is requested and found in cache, it is considered a hit. Higher hit rates are strongly correlated with lower response times. Thus, the cache management system typically aim at maximizing hit probability.
According to some embodiments, a method is described which enables the cache partition sizes to be optimally adjusted based on estimates of cache performance under different cache partition size values. In some embodiments, the cache performance is estimated by determining potential hit gains (PHGa) and potential hit losses (PHLo) associated with cache size changes. The technique is designed to run in real-time cache systems based on LRU eviction policy, and is capable of determining the cache performance for multiple cache partition size changes.
Initial Cache Hit Rate Total=HT1+HT2
Other metrics, such as the weighted sum may also be used to determine the overall hit ratio of the cache.
Adjusted Cache Hit Total=HT1′+HT2′
If the goal of the cache management system 128 is to optimize the overall cache hit rate of the cache 118, it would be beneficial to adjust the cache from the set of partition sizes shown in
A physical memory can be considered to have M slots to be used for caching purposes. Each cache Ci (1≤i≤n) has |Ci| slots, which are physically stored in memory such that Σ∀i|Ci|≤M. A slot is an arbitrary unit of data, the size of which may vary depending on the particular implementation of the storage system. In some embodiments, the slot of memory is also referred to herein as a page. Throughout the description, the notation Ci[j] will be used to refer to the slot in the jth position of Ci such that 1≤j'|Ci|.
A request (x, s) is considered to have a starting address x and a size s≥1. When a request (x,$) is directed to a cache Ci, it will necessarily occupy all the slots between Ci[1] and Ci[s]. In other words, the first s slots of the cache will store the data from addresses x+s−1, x+s−2, . . . , x+1, x (s is greater than 4 in this example).
According to the LRU policy, if a cache Ci is full (i.e. all its slots are occupied) and receives a new request (x,s), one out of the two following scenarios may occur for each address x, x+1, . . . , x+s−1. Specifically, if the address is already on the cache (i.e. a hit), the address is promoted to the slot Ci[1] of the cache. If the address is not already in the cache (i.e. a miss), all the data already in the cache is shifted right to the next slot of the cache, and the new address is pushed to the slot Ci[1]. Therefore, in the event of a cache miss in an LRU cache, the data originally occupying the last slot of the cache 118 Ci[s] is evicted from the cache 118.
According to some embodiments, an estimate of how many additional cache hits would occur by increasing the size of the cache is determined by looking at how many hits occur on data that has just been evicted from the cache. The estimate of how many additional hits a cache will have if the cache 118 is increased in size is referred to herein as Potential Hit Gains (PHGa).
The purpose of PHGa is to assess how many additional hits a cache Ci will have if its size increases by δ slots. This can be measured by monitoring which addresses are evicted from cache and requested again after a short period (referred to herein as the reuse distance). Only the addresses that were evicted during the last δ misses need to be maintained, because those are the ones that would cause a hit if the cache were δ slots bigger.
According to some embodiments, the cache management system maintains a global eviction counter E that sums the total number of addresses evicted from cache on its history.
The evicted addresses are kept in a double-linked list that will be naturally sorted by the value of the eviction counter when that address was evicted. Additionally, each address in the double-linked list has an entry in an auxiliary hash map to speed up the search for an address in the double-linked list. More precisely, such auxiliary data structure maps every page address in the double-linked list to its correspondent element in the list. Whenever the eviction value associated to the oldest address of the list is below E−δ, such address is flushed out from the list and from the hash map. This is possible, because future requests for this address would not increase the number of hits for a cache with δ extra slots.
To compute the potential hit gains, every time an address is requested and it is not in the cache (i.e. a cache miss occurs) a check is performed on the eviction list data structure to determine if the address is contained in the eviction list. If it is, the PHGa counter is incremented. The PHGa counter is evaluated periodically (e.g. at the end of each predetermined time period) to determine how much the PHGa counter changed during the particular time period of interest. The amount the value of the PHGa counter increased during the time period is the amount of additional hits that the cache would have experienced during that time period, if the cache 118 was δ slots larger during that time. Thus, the PHGa counter provides an estimate of an increase in cache hit rate associated with increasing the cache by δ slots. In some embodiments, multiple values of δ can be evaluated simultaneously, so that the cache management system 128 is able to simultaneously determine how different cache size increases would affect the cache hit rate of the cache partition.
In the first stage (at time T=1), the cache 118 is full, but no eviction has ever been made in this cache history. Accordingly, at time T=1, the eviction counter 500 is set to zero (E=0) and the PHGa counter 505 is also set to zero (PHGa=0). The eviction list data structure 510 is also empty because no evictions have yet occurred from cache 118.
In the example shown in
In some embodiments, as shown in
When the request (11, 4) arrives at the cache 118 at time T=2, addresses 11, 12, and 13 are already in the cache. Thus, no eviction needs to occur in connection with these three addresses, and the cache hit on these three addresses simply causes addresses 11, 12, and 13 to be moved to the head of the cache, as shown in the state of the cache at time T=3. However, at time T=2, address 14 is not currently in the cache 118 and, accordingly, another eviction is necessary to open a slot for address 14. Thus, eviction counter 500 is incremented to E=3, and the eviction list is updated to add address 3 and remove address 20. In some embodiments, the process computes a difference between an address' 520 eviction value 515 and the value of the eviction counter E 500, and if the difference is greater than δ, the address is removed from the eviction list. This is shown on stage 3 which shows the cache 118 at time T=3.
At time T=3, a new request for address 53 arrives. In this situation, there is a cache miss, because address 53 is not contained in the cache 118 at time T=3. However, address 53 is on the eviction list, which indicates that if the cache was δ slots larger, a cache hit would have occurred. Thus, the PHGa counter 505 is incremented to PHGa=1 as shown in the state of the cache at time T=4. The eviction necessary to enable address 53 to be added to the cache 118 causes address 9 to be evicted. The eviction list is updated so that the eviction list includes address 9 and 3, and the eviction counter 500 is updated to E=4, as shown in the state of the cache at time T=4.
Although an example was described in which δ was set to 2 for ease of explanation, other values of δ may be used. Likewise, as discussed in greater detail below, it is possible to implement multiple eviction lists or multiple PHGa counters 505 for different δ values, to simultaneously evaluate different cache size changes to identify an optimum value of δ given current IO patterns for that partition of the cache. Additionally, each partition may simultaneously be evaluated using an eviction list data structure 510, eviction counter 500, and PHGa counter 505, to determine which partitions of the cache would benefit from being increased in size and to characterize the increased hit rate associated with increasing the partition sizes of the cache.
In addition to determining the potential hit gain PHGa if a first partition of the cache is increased by δ, it is also necessary to determine the Potential Hit Loss (PHLo) of a second partition of the cache 118. This is because increasing one partition size allocation results in a reduction of another partition size allocation. Reducing the other partition size allocation can result in fewer hits to the other partition of the cache.
The purpose of PHLo is to assess how many fewer hits a cache partition Ci will have if its size decreases by δ slots. In some embodiments, this is measured by monitoring where in the cache the cache hits occur. If a cache hit occurs in the final δ slots of the cache partition, reducing the cache by δ slots would cause a cache miss to occur for that IO. For example, if the current total number of hits of a cache Ci is 1000, and 200 of those hits occur between slots Ci[|Ci|−δ+1] and Ci[|Ci|], the PHLo is 200. In other words, if the cache were δ slots smaller, the cache would have 200 fewer hits.
To compute PHLo, in some embodiments a PHLo data structure 600 in the form of a PHLo hit counter is used to sum the hits that occur on the last δ slots of the cache.
At time T=2, a request (11,4) arrives. Three of the addresses of this request (11, 12, and 13) are contained in the cache 118. Two of these addresses (12 and 13) are within the last δ=2 slots of the cache 118. Accordingly, if the cache partition were δ slots smaller (i.e. 2 slots smaller), these two addresses (12 and 13) would not have been found in the cache 118. To capture the amount of potential loss associated with decreasing the cache size by δ slots, the PHLo hit counter 600 is then incremented. It is important to note that only two hits (the two that are within δ slots of the end of the cache) are used to update the PHLo hit counter 600, since only those hits would have resulted in cache misses if the cache were to be reduced by δ slots.
For the sake of simplicity, an example cache evolution was provided using a single and constant value of δ. It is possible to measure PHGa and PHLo for different values of δ at once for a single cache partition Ci. To do so, the process uses an ordered set D={δ1, δ2, . . . , δk} of increments (or decrements) such that δb<δb+1 ∀δb ∈ D.
To compute PHGa, instead of storing a single counter, a separate counter is used for each different value of δ in the set D. To keep track of addresses that are evicted from the cache, a single double-linked list and hash map is used, but the length of the double-length list is set to be δk (the largest δ under consideration). This enables a single eviction list data structure 510 to be used to keep track of all addresses for each of the δ lengths under consideration. Whenever an address is not found in cache but found in the eviction list data structure 510, a difference is computed between its respective eviction value 515 and the current value of the global eviction counter E. This observed difference d should increment a counter PHGa[δb] such that δb=min(D′) and D′={δ∀δ∈ D|δ≥d}. The total hit gain for a given δx ∈D is then defined as:
Σb=1xPHGa[δb]
For the PHLo, a separate counter is used for each δ ∈ D of decrements. Whenever a hit occurs in a slot Ci[j] we increment a counter PHLo [δb] such that δb=min(D′) and D′={δ∀δ∈D|δ≥|Ci|−j+1}. The total hit loss for a given δx ∈ D is then defined as:
Σb=1xPHGo[δb]
The usefulness of a computational process depends on whether it can be implemented in real time for an actual system. For a large-scale storage system, the computational complexity required to implement calculation of PHGa and PHLo using the method described has a time complexity on the order of 1:O(1) for each new page requested. More precisely, for every eviction all the data structures are updated in O(1) by consulting the hash map, and for every hit we just update the counter associated to the cache hit position.
Where multiple δ are included in the set D, each time a request is processed against the cache the time complexity to implement the calculation of PHGa and PHLo is a O(|D|), because it is necessary to perform the estimated cache performance for all D sizes from these data structures.
In some embodiments, the cache management system 128 maintains the data structure described above and implements the process of determining PHGa and PHLo for multiple or all partitions on cache 118. To enable the storage system to optimize throughput while maintaining SLA compliance with each partition, in some embodiments if the PHGa associated with increasing a first cache partition increases more than the PHLo associated with decreasing a second cache partition, the cache management system 128 adjusts the partition sizes to increase overall performance of the cache 118.
To determine how the partition sizes of the shared cache should be adjusted to increase the overall cache hit rate, the cache management system 128 determines an adjusted cache hit rate if the cache size of cache partition for LUN-1 is decreased by δ and the cache size of the cache partition for LUN-2 is increased by δ (
If the adjusted cache hit rate is not greater than the initial cache hit rate (a determination of NO at block 710), the evaluated changes are not made to the cache partitions for LUN-1 and LUN-2 (
Although the process shown in
Although the preceding examples have focused on an environment where multiple cache partitions are sharing a cache, there are situations where the cache itself may be implemented as a single partition that has a variable size. For example, in a cloud storage environment, storage resources including cache resources may be obtained and paid for as needed. Likewise, it may be desirable to consider a single cache partition in isolation, without consideration of the other cache partitions. Accordingly, as explained in connection with
Accordingly, as shown in
In some embodiments, if the potential hit rate increase is greater than a first threshold (a determination of YES at block 1065), the cache size is increased by the first amount (block 1075). If the potential hit rate decrease is less than a second threshold (a determination of YES at block 1070), the cache size is decreased by the second amount (block 1085). If the potential hit rate increase is not above the first threshold (a determination of NO at block 1065) and the potential hit rate decrease is not less than the second threshold (a determination of NO at block 1070), no adjustment is made to the cache size (block 1080). Although
Additionally, there may be instances where both the potential hit rate increase is above the first threshold (a determination of YES at block 1065) and the potential hit rate decrease is below the second threshold (a determination of YES at block 1070). In this instance the cache rate may be increased by the first amount (block 1075) or another rule may be used to determine how the cache size should be adjusted depending on the cache policy being used to regulate the use and allocation of cache resources.
An experiment was conducted to determine computationally if predicting PHGa and PHLo using this process is able to validly predict cache hit rates.
Using the process described herein, it is therefore possible to estimate the cache hit rates that would occur under varying cache sizes, which can be used to intelligently dynamically adapt cache sizes to changing traffic patterns. The method monitors cache hits and cache evictions to estimate potential gains and losses when the cache size is updated. The proposed technique applies to any cache system, although it is specially designed for those that segment the memory area into separate regions to serve different beneficiaries.
In some embodiments, PHGa and PHLo are continuously monitored over time and periodically the PHGa and PHLo values for various cache partitions are collected and used by the cache management system 128 to determine how the cache partition sizes should be adjusted. This evaluation process by the cache management system 128 may iterate every evaluation period, for example every five minutes, to determine how the cache partition sizes should be dynamically adjusted over time. By monitoring how alternative cache partition sizes would affect the cache hit rate under the current IO operational conditions, the cache management system 128 is able to directly determine what cache partition size adjustments would have increase the current cache hit rate, and determine precisely how much changing the cache partition size adjustments would have increased the current cache hit rate. By making the cache partition size adjustments in real time, the cache management system 128 can dynamically adapt the cache partition sizes to traffic characteristics to increase performance of the storage system 100 as a whole.
The methods described herein may be implemented as software configured to be executed in control logic such as contained in a Central Processing Unit (CPU) or Graphics Processing Unit (GPU) of an electronic device such as a computer. In particular, the functions described herein may be implemented as sets of program instructions stored on a non-transitory tangible computer readable storage medium. The program instructions may be implemented utilizing programming techniques known to those of ordinary skill in the art. Program instructions may be stored in a computer readable memory within the computer or loaded onto the computer and executed on computer's microprocessor. However, it will be apparent to a skilled artisan that all logic described herein can be embodied using discrete components, integrated circuitry, programmable logic used in conjunction with a programmable logic device such as a Field Programmable Gate Array (FPGA) or microprocessor, or any other device including any combination thereof. Programmable logic can be fixed temporarily or permanently in a tangible computer readable medium such as random-access memory, a computer memory, a disk, or other storage medium. All such embodiments are intended to fall within the scope of the present invention.
Throughout the entirety of the present disclosure, use of the articles “a” or “an” to modify a noun may be understood to be used for convenience and to include one, or more than one of the modified noun, unless otherwise specifically stated.
Elements, components, modules, and/or parts thereof that are described and/or otherwise portrayed through the figures to communicate with, be associated with, and/or be based on, something else, may be understood to so communicate, be associated with, and or be based on in a direct and/or indirect manner, unless otherwise stipulated herein.
Various changes and modifications of the embodiments shown in the drawings and described in the specification may be made within the spirit and scope of the present invention. Accordingly, it is intended that all matter contained in the above description and shown in the accompanying drawings be interpreted in an illustrative and not in a limiting sense. The invention is limited only as defined in the following claims and the equivalents thereto.