The concept of using a data cache for significantly faster data access is well known in computing technology. For example, current computer-related storage systems tend to use one or two layers of caching in a storage hierarchy, e.g., a RAM cache above a hard disk, and a hard disk cache above a network connection. Most caching solutions are thus based upon techniques for RAM or hard disk caching.
Solid-state nonvolatile memory (e.g., flash memory) comprises a relatively new type of storage media whose peculiarities with respect to handling data writes makes solid-state device memory particularly inappropriate for use with caching algorithms that were designed for RAM or hard disk caches. For example, the cost of solid-state device memory per storage unit is considerably higher than hard disk cost. Further, solid-state storage class devices are not configured for byte-level access, but rather page-based access, with page-based copy on write and garbage collection techniques used, which slows down access and causes increased device wear.
With respect to solid-state device wear, solid-state device media wearing from data writes makes a solid-state device-based cache unsuitable for use in as a cache with a high amount of churn, as the correspondingly large amount of data writes results in a significantly decreased solid-state device lifetime. As a result, a solid-state device may not last up to the next hardware replacement cycle in a data center, for example. Note that multi-level cell (MLC) flash devices are on the order of three to four times less expensive than single-level cell (SLC) flash devices, and while this may mitigate some cost considerations, MLC flash devices have a shorter device lifetime than SLC flash devices.
This Summary is provided to introduce a selection of representative concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used in any way that would limit the scope of the claimed subject matter.
Briefly, various aspects of the subject matter described herein are directed towards a technology by which objects are written to a multi-tiered cache having cache tiers with different access properties. Objects are written to a selected tier of the cache based upon object-related properties and/or cache-related properties.
In one aspect, a cache hierarchy includes a plurality of tiers having different access properties, including a volatile cache tier and at least one non-volatile cache tier. Objects are associated with object scores and each tier is associated with a minimum access scoring threshold. Each object is stored in a highest priority level tier having a minimum access scoring threshold that allows storing the object based on the object score.
In one aspect, there is described maintaining a plurality of logs, including writing objects to an active log. The active log is sealed upon reaching a target size, with a new active log opened. Garbage collecting is performed on a sealed log, which may be selected based on an amount of garbage therein relative to the amount of garbage in at least one other sealed log.
Other advantages may become apparent from the following detailed description when taken in conjunction with the drawings.
The present invention is illustrated by way of example and not limited in the accompanying figures in which like reference numerals indicate similar elements and in which:
Various aspects of the technology described herein are generally directed towards a cache management technology that is aware of object properties and storage media characteristics, including solid-state device properties. The technology is based upon organizing storage in a tiered cache of layers across RAM, solid-state device memory, and (possibly) hard disk based upon the storage media characteristics and/or object properties. In one aspect, the storage is managed to make the solid-state device media device last a desired lifetime.
In one aspect, the technology considers specific properties, e.g., cost, capacity, access properties such as set up time (seek time, control setup time) of the caches, and the throughput of the cache of each type of media (RAM, solid-state device memory, hard disk) and/or object properties (size, access patterns) to optimize placement of cached data in an appropriate tier of the cache. The technology attempts to optimize for the properties for each type of storage media for caching data in a tiered caching hierarchy. For example, in one embodiment, only frequently used (“hot”) objects are written to solid-state device, the solid-state device is organized into a multiple log structured organization, and a configurable write budget policy is enforced to control device lifetime.
It should be understood that any of the examples herein are non-limiting. As such, the present invention is not limited to any particular embodiments, aspects, concepts, structures, functionalities or examples described herein. Rather, any of the embodiments, aspects, concepts, structures, functionalities or examples described herein are non-limiting, and the present invention may be used various ways that provide benefits and advantages in computing and dynamic recognition in general.
It should be noted that rather than using a write-through cache with respect to an object being written to ground truth storage, a write-back caching model may be used, in which any “dirty” (modified or created) object need only have the dirtied object maintained in any non-volatile storage. This leverages the non-volatility of the solid-state device cache and any hard disk cache, as no dirtied object is lost in the event of power failure. A clean object that is unchanged with respect to the object as backed in a non-volatile storage may reside in RAM, as loss of such an object from a power failure will not matter.
As represented in
As represented in
Turning to a description of the flow of objects through the cache hierarchy, initially, as the system starts out, objects are cached in RAM 112. The RAM metadata cache 116 may be used to track the hotness (the hit count) of objects, whether they are cached or not. When evicted, objects in the RAM that are sufficiently hot (after having established their hotness as tracked in the RAM metadata cache) are attempted to be copied into the solid-state device cache.
More particularly, eventually the RAM cache will fill up to some threshold level (e.g., ninety percent full), whereby one or more objects need to be evicted from the RAM 112 to bring the RAM 112 to a less full state (e.g., eighty percent full). Objects to be evicted are selected by the corresponding tier's eviction expert as described below.
Note that in a typical system there are likely too many objects to have a metadata cache to track the hit count/score of each object. Instead, as described herein, the metadata cache may be sized to handle a limited number of objects, which, for example, may be based upon as some multiple of the solid-state device memory's capacity, e.g., two times, or four times. For example, if the solid-state device capacity is generally sufficient to cache ten million objects, the metadata cache may be sized to contain twenty (or forty) million entries. The size of the metadata cache may be adaptively resized (e.g., based on thrashing rate and/or hit rate).
Steps 308 and 310 represent considering whether to evict objects from the hard drive cache. Note that unlike the example represented in
Step 312 represents writing the object into the RAM cache, with step 314 representing the updating of the metadata cache to reflect that the object that now exists in the RAM cache has been accessed.
For any cache it is feasible to use selective allocation that is integrated with eviction. For example, space may be allocated a new object only if the score is higher than the worst scored object currently in the cache (example scoring is described below). Thus, if such a mechanism is in place, an attempt to write the object into solid-state device or the hard drive may be made instead.
Steps 316 and 318 are directed towards determining whether the RAM cache is full and an eviction operation needs to be performed. Note that step 316 need not be performed on each write to the RAM cache, but rather based upon some other triggering mechanism.
In general, eviction from any cache is based upon finding the objects having the lowest scores. A score can be a hit count, or a more complex value based upon other factors.
More particularly, instead of valuing an object based on a score that is only a hit count, more complex object scoring may be used to determine whether to evict an object from a given cache. One general score formula for an object is based on its access-likelihood (e.g., a hit count of access with some aging adjustment) times the retrieval cost of the object (e.g., a delta between solid-state device access time and hard disk access time) divided by the benefit of evicting the object from the cache, (e.g., the larger the size of the object, the more space freed up by eviction.
This may be represented as:
Score(p)=L(p)*C(p,c)/B(p),
in which L(p) is the access-likelihood of object p, C(p, c) is the retrieval cost of object p for cache tier c, and B(p) is the benefit from evicting object p from cache.
The retrieval cost may be computed as:
C(p,c)=seekcl−seekc+size(p)/throughputcl−size(p)/throughputc,
where C(p, c) is the difference of the retrieval cost, c is the cache tier to be evaluated, cl is the lower priority cache tier, seek, is the set up time of the cache tier to be evaluated, seekcl is the set up time of the lower cache tier, size(p) is the size of the object, and throughputc is the throughput of cache device c. The benefit B(p) from evicting the object p from cache may be the size of the object, B(p)=size(p).
Another suitable scoring algorithm for an object p comprises:
Score(p)=freq(p)*Delta-Access-time(p)/size(p).
Such a score can be computed on demand, which allows for factoring in aging as part of freq(p) into the above scoring algorithm.
Note that frequency values are stored with a limited number of bits (and thus have a maximum frequency value). Each time the object is accessed, the access frequency value is incremented, up to the maximum. Periodically, the access frequency value of all objects may be reduced, e.g., by half.
Real-time eviction is one option, e.g., searching the cache metadata with the key being an object score to find the lowest scored object. As a new object comes in, the object having the lowest score is evicted to the solid-state device cache or hard drive as described herein.
Then, starting at a random location and a first object at that location, for example (step 404), objects at or below the eviction threshold score are evicted via step 406. Eviction continues via steps 418 and 420 until the desired level of occupancy is reached; (in the unlikely event that the sampling was such that not enough objects were evicted, a new sample may be taken). Starting at the random location ensures that over time no object will escape an eviction evaluation.
As described herein, objects that are evicted from the RAM cache 104, e.g., those least recently used or least frequently used, only enter the solid-state device cache 106 if they have established their hotness. Otherwise, such objects enter the hard disk cache 108. Other criteria for entering the solid-state device cache 106 need to be met, including meeting a solid-state device budget policy (described below) and an object size limitation (step 408). More particularly, objects beyond a certain (configurable) size threshold are not entered into the solid-state device cache tier 106. This is generally because hard disks work well for sequential access of large objects by amortizing seek times over transfer times, and entering such objects into the solid-state device cache tier 106 does not provide access times that justify the cost of writing to solid-state device, as determined by an administrator or the like who sets the threshold size, e.g., via experimentation, analysis, tuning, and/or the like.
Note that hotness may be combined with the solid-state device budget evaluation (step 410), as described below. Thus steps 408 and 410 need to be met before eviction to the solid-state device occurs at step 412; (a “not less than the lowest score” rule may also be enforced). Otherwise eviction is to the hard drive at step 414. Step 416 represents updating the metadata/any indexes as appropriate, e.g., to reflect the eviction.
Turning to write budgeting with respect to solid-state device cache writes, the amount of writes to the solid-state device is controllable to avoid wear by enforcing a budget per time period, such as writes per day or writes per hour, or the like. In general, when there is no more write-budget left within a period, writes are denied until the budget is refilled in the next period. Unused budget may rollover into the next time period.
In one implementation, a starting budget value (number of allowed writes) per time period may be based upon a desired lifetime of the solid-state device, based on predicted lifetimes for different solid-state device media, and/or for writes relative to reads. This value may change over time based upon actual usage, e.g., during a transition time. For example, if budget is unused (or is tending to be unused over multiple time periods), the threshold score (or hit count) of an object that needs to be met for writing to the solid-state device may be decreased. Conversely, if a budget is denying writes (or is tending to deny writes over multiple time periods), the threshold score for entering the cache may be increased. Thus, the budget policy provides for adaptively adjusting the value of writes to naturally maximize performance while not exceeding media durability.
As represented in
In one implementation, to avoid excessive data movement within the solid-state device (or hard drive), which causes wear, when the target size of the active log is reached as evaluated at step 504, at step 506 the active log is sealed (from new writes) and a new log for writes opened to be the currently active log.
As sealed logs do not receive writes but can have objects evicted therefrom by marking them as deleted for later garbage collection, over time sealed logs become more and more sparse with respect to useable data. Garbage collection, represented by steps 508 and 510 in
In one implementation as represented by step 510, garbage collection is only performed with respect to which one of the sealed logs has the most garbage (is the most sparse), by copying non-evicted data to the active log. The garbage collected sealed log's storage space may then be reclaimed for subsequent allocation. Objects evicted out of solid-state device cache 106 enter the hard disk cache 108. Objects evicted from the hard disk cache may be discarded, as long as a ground truth storage layer exists elsewhere, e.g., elsewhere on the hard disk, in a cloud storage, a server, and the like.
Via the backend storage interface 110, the durability of writes can be guaranteed by always writing objects to the backend store 109 (using a “write-through” policy) and invalidating object copies in the cache hierarchy 102. Alternatively, the nonvolatile nature of some layers of the cache hierarchy 102 allow a “write-back” policy to the backend store 109. Thus, in this alternative, different layers in the multi-tiered cache may contain different versions of the object. Because lookups check the hierarchy in order, from RAM to solid-state device to hard disk, the timeline of versions for the same object in the cache need to follow this order, with the most recent version in the highest layer (among the layers where it is present).
As can be seen, there is described a multi-tiered, multi-storage media cache hierarchy that leverages the benefits/peculiarities of each type of storage medium. For example, to achieve a desired solid-state device lifetime, a write budget policy may be built into an object scoring function so that cache churn rate (insertion and eviction) decisions are made without wearing out the device faster than desired.
The cache hierarchy is also aware of an object's properties (e.g., object size and access patterns such as hot or cold, random or sequential, read or write) in deciding where to place the object in the cache hierarchy. For example, a metadata cache is used to track the hotness of some number of objects (e.g., based on a multiple of the cache size), with a write allocation in the cache only made when the object proves itself to be hot. This is particularly applicable to the solid-state device cache because writes reduce solid-state device lifetime.
Also described is a multi-log structured organization of variable sized objects on secondary storage (solid-state device or hard disk); only one log is active at a time for sequential writes. Logs are garbage collected in “highest fraction of garbage content” order first.
The invention is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to: personal computers, server computers, hand-held or laptop devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, and so forth, which perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in local and/or remote computer storage media including memory storage devices.
With reference to
The computer 610 typically includes a variety of computer-readable media. Computer-readable media can be any available media that can be accessed by the computer 610 and includes both volatile and nonvolatile media, and removable and non-removable media. By way of example, and not limitation, computer-readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, solid-state device memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by the computer 610. Communication media typically embodies computer-readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of the any of the above may also be included within the scope of computer-readable media.
The system memory 630 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 631 and random access memory (RAM) 632. A basic input/output system 633 (BIOS), containing the basic routines that help to transfer information between elements within computer 610, such as during start-up, is typically stored in ROM 631. RAM 632 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 620. By way of example, and not limitation,
The computer 610 may also include other removable/non-removable, volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media, described above and illustrated in
The computer 610 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 680. The remote computer 680 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 610, although only a memory storage device 681 has been illustrated in
When used in a LAN networking environment, the computer 610 is connected to the LAN 671 through a network interface or adapter 670. When used in a WAN networking environment, the computer 610 typically includes a modem 672 or other means for establishing communications over the WAN 673, such as the Internet. The modem 672, which may be internal or external, may be connected to the system bus 621 via the user input interface 660 or other appropriate mechanism. A wireless networking component 674 such as comprising an interface and antenna may be coupled through a suitable device such as an access point or peer computer to a WAN or LAN. In a networked environment, program modules depicted relative to the computer 610, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
An auxiliary subsystem 699 (e.g., for auxiliary display of content) may be connected via the user interface 660 to allow data such as program content, system status and event notifications to be provided to the user, even if the main portions of the computer system are in a low power state. The auxiliary subsystem 699 may be connected to the modem 672 and/or network interface 670 to allow communication between these systems while the main processing unit 620 is in a low power state.
While the invention is susceptible to various modifications and alternative constructions, certain illustrated embodiments thereof are shown in the drawings and have been described above in detail. It should be understood, however, that there is no intention to limit the invention to the specific forms disclosed, but on the contrary, the intention is to cover all modifications, alternative constructions, and equivalents falling within the spirit and scope of the invention.