Various embodiments of the present application generally relate to the field of managing data storage systems. More specifically, various embodiments of the present application relate to methods and systems for deduplicating a cached hybrid storage aggregate.
The proliferation of computers and computing systems has resulted in a continually growing need for reliable and efficient storage of electronic data. A storage server is a specialized computer that provides storage services related to the organization and storage of data. The data is typically stored on writable persistent storage media, such as non-volatile memories and disks. The storage server may be configured to operate according to a client/server model of information delivery to enable many clients or applications to access the data served by the system. The storage server can employ a storage architecture that serves the data with both random and streaming access patterns at either a file level, as in network attached storage (NAS) environments, or at the block level, as in a storage area network (SAN).
The various types of non-volatile storage media used by a storage server can have different latencies. Access time (or latency) is the period of time required to retrieve data from the storage media. In many cases, data is stored on hard disk drives (HDDs) which have a relatively high latency. In HDDs, disk access time includes the disk spin-up time, the seek time, rotational delay, and data transfer time. In other cases, data is stored on solid-state drives (SSDs). SSDs generally have lower latencies than HDDs because SSDs do not have the mechanical delays inherent in the operation of the HDD. HDDs generally provide good performance when reading large blocks of data which is stored sequentially on the physical media. However, HDDs do not perform as well for random accesses because the mechanical components of the device must frequently move to different physical locations on the media.
SSDs typically use solid-state memory, such as non-volatile flash memory, to store data. With no moving parts, SSDs typically provide better performance for random and frequent memory accesses because of the relatively low latency. However, SSDs are generally more expensive than HDDs and sometimes have a shorter operational lifetime due to wear and other degradation. These additional upfront and replacement costs can become significant for data centers which have many storage servers using many thousands of storage devices.
Hybrid storage aggregates combine the benefits of HDDs and SSDs. A storage “aggregate” is a logical aggregation of physical storage; i.e., a logical container for a pool of storage, combining one or more physical mass storage devices or parts thereof into a single logical storage object, which contains or provides storage for one or more other logical data sets at a higher level of abstraction (e.g., volumes). In some hybrid storage aggregates, relatively expensive SSDs make up part of the hybrid storage aggregate and provide high performance, while relatively inexpensive HDDs make up the remainder of the storage array. In some cases other combinations of storage devices with various latencies may also be used in place of or in combination with the HDDs and SSDs. These other storage devices include non-volatile random access memory (NVRAM), tape drives, optical disks and micro-electro-mechanical (MEMs) storage devices. Because the low latency (i.e., SSD) storage space in the hybrid storage aggregate is limited, the benefit associated with the low latency storage is maximized by using it for storage of the most frequently accessed (i.e., “hot”) data. The remaining data is stored in the higher latency devices. Because data and data usage change over time, determining which data is hot and should be stored in the lower latency devices is an ongoing process. Moving data between the high and low latency devices is a multi-step process that requires updating of pointers and other information that identifies the location of the data.
In some cases, the lower latency storage is used as a cache for the higher latency storage. In these configurations, copies of the most frequently accessed data are stored in the cache. When a data access is performed, the faster cache may first be checked to determine if the required data is located therein, and, if so, the data may be accessed from the cache. In this manner, the cache reduces overall data access times by reducing the number of times the higher latency devices must be accessed. In some cases, cache space is used for data which is being frequently written (i.e., a write cache). Alternatively, or additionally, cache space is used for data which is being frequently read (i.e., read cache). The policies for management and operation of read caches and write caches are often different.
In order to more efficiently use the available data storage space in a storage system and minimize costs, various techniques are used to compress data and/or minimize the number of instances of duplicate data. Data deduplication is one method of removing duplicate instances of data from the storage system. Data deduplication is a technique for eliminating coarse-grained redundant data. In a deduplication process, blocks of data are compared to other blocks of data stored in the system. When two or more identical blocks of data are identified, the redundant block(s) are deleted or otherwise released from the system. The metadata associated with the deleted block(s) is modified to point to the instance of the data block which was not deleted. In this way, two or more applications or files can utilize the same block of data for different purposes. The deduplication process saves storage space by coalescing the duplicate data blocks and coordinating the sharing of a single instance of the data block. However, performing deduplication in a hybrid storage aggregate without taking the caching statuses of the data blocks into account may inhibit or counteract the performance benefits of using caches.
Methods and apparatuses for performing deduplication in a hybrid storage aggregate are introduced here. These techniques involve deduplicating hybrid storage aggregates in manners which take the caching statuses of the blocks to be deduplicated into account. Data blocks may be deduplicated differently depending on whether they are read cache blocks, read cached blocks, write cache blocks, or blocks which do not have any caching status. Taking these statuses into account enables the system to get the space optimizing benefits of deduplication. If deduplication is implemented without taking these statuses into account, performance benefits associated with the caching may be counteracted.
In one example, such a method includes operating a hybrid storage aggregate that includes a plurality of tiers of different types of physical storage media. The method includes identifying a first storage block and a second storage block of the hybrid storage aggregate that contain identical data and identifying caching statuses of the first storage block and the second storage block. The method also includes deduplicating the first storage block and the second storage block based on the caching statuses of the first storage block and the second storage block. The implementation of the deduplication process may vary for each pair of blocks depending on whether the blocks are read cache blocks, read cached blocks, or write cache blocks. As used herein, a “read cache block” generally refers to a data block in a lower latency tier of the storage system which is serving as a higher performance copy of the “read cached block” which is in a higher latency tier of the storage system. A “write cache” block generally refers to a data block which is located in the lower latency tier for purposes of write performance.
In another example, a storage server system comprises a processor, a hybrid storage aggregate, and a memory. The hybrid storage aggregate includes a first tier of storage and a second tier of storage. The first tier of storage has a lower latency than the second tier of storage. The memory is coupled with the processor and includes a storage manager. The storage manager directs the processor to identify a first storage block and a second storage block in the hybrid storage aggregate that contain duplicate data. The storage manager then identifies caching relationships associated with the first storage block and the second storage block and deduplicates the first and the second storage blocks based on the caching relationships.
If deduplication is performed without taking the caching relationships into account, the performance benefit associated with the caching may be diminished or eliminated. For example, one block of hot data may be cached in a low latency storage tier for performance reasons. Another data block, which is a duplicate of the hot data block, may be stored in the high latency tier. If the caching status is not taken into account, the deduplication process may result in removal of the hot data block from the low latency tier and modification of the metadata associated the hot data block such that accesses to the data block are directed to the duplicate copy in the high latency tier. This outcome reduces or removes the performance benefit of the hybrid storage aggregate. Therefore, it is beneficial to perform the deduplication in a manner which preserves the hybrid storage aggregate performance benefit. In some cases, the deduplication process may vary further depending on whether the block(s) are being used as read cache or write cache blocks.
Embodiments introduced here also include other methods, systems with various components, and non-transitory machine-readable storage media storing instructions which, when executed by one or more processors, direct the one or more processors to perform the methods, variations of the methods, or other operations described herein. While multiple embodiments are disclosed, still other embodiments will become apparent to those skilled in the art from the following detailed description, which shows and describes illustrative embodiments of the invention. As will be realized, the invention is capable of modifications in various aspects, all without departing from the scope of the present invention. Accordingly, the drawings and detailed description are to be regarded as illustrative in nature and not restrictive.
Embodiments of the present invention will be described and explained through the use of the accompanying drawings in which:
The drawings have not necessarily been drawn to scale. For example, the dimensions of some of the elements in the figures may be expanded or reduced to help improve the understanding of the embodiments of the present invention. Similarly, some components and/or operations may be separated into different blocks or combined into a single block for the purposes of discussion of some of the embodiments of the present invention. Moreover, while the invention is amenable to various modifications and alternative forms, specific embodiments are shown by way of example in the drawings and are described in detail below. The intention, however, is not to limit the invention to the particular embodiments described. On the contrary, the invention is intended to cover all modifications, equivalents, and alternatives falling within the scope of the invention as defined by the appended claims.
Some data storage systems include persistent storage space which is made up of different types of storage devices with different latencies. The low latency devices offer better performance but typically have cost and/or other drawbacks. Implementing a portion of the system with low latency devices provides some performance improvement without incurring the cost or other limitations associated with implementing the entire storage system with these types of devices. The system performance improvement may be optimized by selectively caching the most frequently accessed data (i.e., the hot data) in the lower latency devices. This maximizes the number of reads and writes to the system which will occur in the faster, lower latency devices. The storage space available in the lower latency devices may be used to implement a read cache, a write cache, or both.
In order to make the most efficient use of the available storage space, various types of data compression and consolidation are often implemented. Data deduplication is one method of removing duplicate instances of data from the storage system in order to free storage space for additional, non-duplicate data. In the deduplication process, blocks of data are compared to other blocks of data stored in the system. When identical blocks of data are identified, the redundant block is replaced with a pointer or reference that points to the remaining stored chunk. Two or more applications or files share the same stored block of data. The deduplication process saves storage space by coalescing these duplicate data blocks and coordinating the sharing of a single remaining instance of the block. However, performing deduplication on data blocks without taking into account whether those blocks are cache or cached blocks may have detrimental effects on the performance gains associated with the hybrid storage aggregate. As used herein, a “block” of data is a contiguous set of data of a known length starting at a particular address value. In certain embodiments, each level 0 block is 4 kBytes in length. However, the blocks could be other sizes.
The techniques introduced here resolve these and other problems by deduplicating the hybrid storage aggregate based on the caching statuses of the blocks being deduplicated. Deduplication often involves deleting, removing, or otherwise releasing one of the duplicate blocks. In some cases, one of the duplicate blocks is read cached in the lower latency storage and the performance benefits are maintained by deleting the duplicate block which is not read cached. In other cases, one of the duplicate blocks is write cached and the deduplication process improves performance of the system, without deleting one of the duplicate blocks, by extending the performance benefit of the write cached blocked to the identified duplicate instance of the block.
Storage server system 130 includes storage server 140, HDD 150A, HDD 150B, SSD 160A, and SSD 160B. Storage server system 130 may also include other devices or storage components of different types which are used to manage, contain, or provide access to data or data storage resources. Storage server 140 is a computing device that includes a storage operating system that implements one or more file systems. Storage server 140 may be a server-class computer that provides storage services relating to the organization of information on writable, persistent storage media such as HDD 150A, HDD 150B, SSD 160A, and SSD 160B. HDD 150A and HDD 150B are hard disk drives, while SSD 160A and SSD 160B are solid state drives (SSD).
A typical storage server system will include many more HDDs or SSDs than are illustrated in
Storage server 140 performs deduplication on data stored in HDD 150A, HDD 150B, SSD 160A, and SSD 160B according to embodiments of the invention described herein. The teachings of this description can be adapted to a variety of storage server architectures including, but not limited to, a network-attached storage (NAS), storage area network (SAN), or a disk assembly directly-attached to a client or host computer. The term “storage server” should therefore be taken broadly to include such arrangements.
Hybrid storage aggregate 280 is a logical aggregation of the storage in HDD array 250 and SSD array 260. In this example, hybrid storage aggregate 280 is a collection of RAID groups which may include one or more volumes. RAID module 270 organizes the HDDs and SSDs within a particular volume as one or more parity groups (e.g., RAID groups) and manages placement of data on the HDDs and SSDs. RAID module 270 further configures RAID groups according to one or more RAID implementations to provide protection in the event of failure of one or more of the HDDs or SSDs. The RAID implementation enhances the reliability and integrity of data storage through the writing of data “stripes” across a given number of HDDs and/or SSDs in a RAID group including redundant information (e.g., parity). HDD controller 254 and SSD controller 264 perform low level management of the data which is distributed across multiple physical devices in their respective arrays. RAID module 270 uses HDD controller 254 and SSD controller 264 to respond to requests for access to data in HDD array 250 and SSD array 260.
Memory 220 includes storage locations that are addressable by processor 240 for storing software programs and data structures to carry out the techniques described herein. Processor 240 includes circuitry configured to execute the software programs and manipulate the data structures. Storage manager 224 is one example of this type of software program. Storage manager 224 directs processor 240 to, among other things, implement one or more file systems. Processor 240 is also interconnected to network interface 292. Network interface 292 enables other devices or systems to access data in hybrid storage aggregate 280.
In one embodiment, storage manager 224 implements data placement or data layout algorithms that improve read and write performance in hybrid storage aggregate 280. Storage manager 224 may be configured to relocate data between HDD array 250 and SSD array 260 based on access characteristics of the data. For example, storage manager 224 may relocate data from HDD array 250 to SSD array 260 when the data is determined to be hot, meaning that the data is frequently accessed, randomly accessed, or both. This is beneficial because SSD array 260 has lower latency and having the most frequently and/or randomly accessed data in the limited amount of available SSD space will provide the largest overall performance benefit to storage system 200.
In the context of this explanation, the term “randomly” accessed, when referring to a block of data, pertains to whether the block of data is accessed in conjunction with accesses of other blocks of data stored in the same physical vicinity as that block on the storage media. Specifically, a randomly accessed block is a block that is accessed not in conjunction with accesses of other blocks of data stored in the same physical vicinity as that block on the storage media. While the randomness of accesses typically has little or no affect on the performance of solid state storage media, it can have significant impacts on the performance of disk based storage media due to the necessary movement of the mechanical drive components to different physical locations of the disk. A significant performance benefit may be achieved by relocating a data block that is randomly accessed to a lower latency tier, even though the block may not be accessed frequently enough to otherwise qualify it as hot data. Consequently, the frequency of access and nature of the access (i.e., whether the accesses are random) may be jointly considered in determining which data should be located to a lower latency tier.
In another example, storage manager 224 may initially store data in the SSDs of SSD array 260. Subsequently, the data may become “cold” in that it is either infrequently accessed or frequently accessed in a sequential manner. As a result, it is preferable to move this cold data from SSD array 260 to HDD array 250 in order to make additional room in SSD array 260 for hot data. Storage manager 224 cooperates with RAID module 270 to determine initial storage locations, monitor data usage, and relocate data between the arrays as appropriate. The criteria for the threshold between hot and cold data may vary depending on the amount of space available in the low latency tier.
In at least one embodiment, data is stored by hybrid storage aggregate 280 in the form of logical containers such as volumes, directories, and files. A “volume” is a set of stored data associated with a collection of mass storage devices, such as disks, which obtains its storage from (i.e., is contained within) an aggregate, and which is managed as an independent administrative unit, such as a complete file system. Each volume can contain data in the form of one or more files, directories, subdirectories, logical units (LUNs), or other types of logical containers.
Files in hybrid storage aggregate 280 can be represented in the form of a buffer tree, such as buffer tree 300 in
A buffer tree includes one or more levels of indirect blocks that contain one or more pointers to lower-level indirect blocks and/or to the direct blocks. Determining the actual physical location of a block may require working through several levels of indirect blocks. In the example of buffer tree 300, the blocks designated as “Level 1” blocks are indirect blocks. These blocks point to the “Level 0” blocks which are the direct blocks of the file. Additional levels of indirect blocks are possible. For example, buffer tree 300 may include level 2 blocks which point to level 1 blocks. In some cases, some level 2 blocks of a group may point to level 1 blocks, while other level 2 blocks of the group point to level 0 blocks.
The root of buffer tree 300 is inode 322. An inode is a metadata container used to store metadata about the file, such as ownership of the file, access permissions for the file, file size, file type, and pointers to the highest-level of indirect blocks for the file. The inode is typically stored in a separate inode file. The inode is the starting point for finding the location of all of the associated data blocks. In the example illustrated, inode 322 references level 1 indirect blocks 324 and 325. Each of these indirect blocks stores a least one physical volume block number (PVBN) and a corresponding virtual volume block number (WBN). For purposes of illustration, only one PVBN-WBN pair is shown in each of indirect blocks 324 and 325. However, many PVBN-VVBN pairs may be included in each indirect block. Each PVBN references a physical block in hybrid storage aggregate 280 and the corresponding VVBN references the associated logical block number in the volume. In the illustrated embodiment, the PVBN in indirect block 324 references physical block 326 and the PVBN in indirect block 325 references physical block 328. Likewise, the VVBN in indirect block 324 references logical block 327 and the WBN in indirect block 325 references logical block 329. Logical blocks 327 and 329 point to physical blocks 326 and 328, respectively.
A file block number (FBN) is the logical position of a block of data within a particular file. Each FBN maps to a WBN-PVBN pair within a volume. Storage manager 224 implements a FBN to PVBN mapping. Storage manager 224 further cooperates with RAID module 270 to control storage operations of HDD array 250 and SSD array 260. Storage manager 224 translates each FBN into a PVBN location within hybrid storage aggregate 280. A block can then be retrieved from a storage device using topology information provided by RAID module 270.
When a block of data in HDD array 250 is moved to another location within HDD array 250, the indirect block associated with the block is updated to reflect the new location. However, inode 322 and the other indirect blocks may not need to be changed. Similarly, a block of data that is moved between HDD array 250 and SSD array 260 by copying the block to the new physical location and updating the associated indirect block with the new location. The various blocks that make up a file may be scattered among many non-contiguous physical locations and may even be split across different types of storage media such as those which make up HDD array 250 and SSD array 260. Throughout the remainder of this description, the changes to a buffer tree associated with movement of a data block will be described as changes to the metadata of the block to point to a new location. Changes to the metadata of a block may include changes to any one or any combination of the elements of the associated buffer tree.
In some cases, the block(s) which are deleted from the buffer tree through the deduplication process are referred to as recipient blocks. In the examples of
In one example, deduplication is performed by generating a unique fingerprint for each data block when it is stored. This can be accomplished by applying the data block to a hash function, such as SHA-256 or SHA-512. Two or more identical data blocks will always have the same fingerprint. By comparing the fingerprints during the deduplication process, duplicate data blocks can be identified and coalesced as illustrated in
It should be understood that storage arrays including other types of storage devices may be substituted for one or both of HDD array 650 and SSD array 670. Furthermore, additional storage arrays may be added to provide a system which contains three or more tiers of storage each having latencies which differ from the other tiers. As in
A read cache block is a copy of a data block created in a lower latency storage tier for a data block which is currently being read frequently (i.e., the data block is hot). Because the block is being read frequently, incremental performance improvement can be achieved by placing a copy of the block in a lower latency storage tier and directing requests for the block to the lower latency storage tier. In
For example, when a request is received to read data block 663, cachemap 610 is first checked to see if a copy of data block 663 is available in SSD array 670. Cachemap 610 includes information indicating that data block 683 is available as a copy of data block 663 and provides its location, along with information about all of the other blocks which are stored in SSD array 670. In this case, because a copy of data block 663 is available, the read request is satisfied by reading data block 683. In other words, HDD array 650 is not accessed in the reading of data associated with data block 663. Data block 683 can be read more quickly than data block 663 due to the characteristics of SSD array 670. When data block 663 is no longer hot, the references to data block 663 and data block 683 are removed from cachemap 610. The physical storage space occupied by data block 683 can then be used for other hot data blocks or for other purposes.
In contrast, the deduplication process illustrated in
In addition, deleting or releasing data block 663 would disrupt the read cache arrangement which already exists because information stored in cachemap 610 already links data block 663 with data block 683. Consequently, it is most efficient to release or delete data block 664, rather than data blocks 663 or 683, in order to accomplish the deduplication. The metadata in indirect block 625A associated with data block 664 is updated to point to data block 663.
By selectively performing the deduplication based on the caching statuses of the data blocks, the caching benefit associated with data block 663 which was already in place has not only been preserved, but a duplicate benefit has been realized. Storage space is freed in HDD array 650 and the performance benefit of data block 683 is realized through reads associated with both inode 622A and inode 622B.
In
While the deduplication process of
In the example of
For example, if data block 783 continues to be hot or is expected to continue to be hot, there is potentially little benefit in deduplicating it with data block 764. This is true because there is a high likelihood that the data will change the next time it is written. In other words, data block 783 and data block 764 may be the same at the moment and data block 764 could be deduplicated to data block 783 but data block 783 will likely change in a relatively short period of time. Once a change to the data block has occurred in conjunction with either inode 722A or inode 722B, the deduplication process would have to be reversed because the data blocks needed by the two inodes would no longer be the same. While this is true in any deduplication situation, the probability of it occurring is much higher in write cache situations because the block is already known to be one which is being frequently written. The overhead of performing the deduplication process on data blocks 764 and 783 may provide little or no benefit. In other words, it may be most beneficial to avoid deduplicating a write cache block as part of a deduplication process even though it is a duplicate of another data block in the file system.
Returning to step 810, if neither block is a write cache block, a next determination is made at step 820 to identify whether either block is read cached. If neither block is read cached, the two blocks are deduplicated at step 860. This is accomplished by modifying the metadata for a first one of the blocks to point to the other block and the first block is otherwise deleted or released. Step 860 is performed in a manner similar to that discussed with respect to
Returning to step 820, if one of the blocks is read cached, the two blocks are deduplicated by modifying the metadata of one block to point to the other block at step 870. Metadata associated with the other block is also modified to point to the existing read cache block (i.e., a third data block in the SSD array which contains identical data to the two identified blocks). Step 870 is performed in a manner similar to that discussed with respect to
Embodiments of the present invention include various steps and operations, which have been described above. A variety of these steps and operations may be performed by hardware components or may be embodied in machine-executable instructions, which may be used to cause one or more general-purpose or special-purpose processors programmed with the instructions to perform the steps. Alternatively, the steps may be performed by a combination of hardware, software, and/or firmware.
Embodiments of the techniques introduced here may be provided as a computer program product, which may include a machine-readable medium having stored thereon non-transitory instructions which may be used to program a computer or other electronic device to perform some or all of the operations described herein. The machine-readable medium may include, but is not limited to optical disks, compact disc read-only memories (CD-ROMs), magneto-optical disks, floppy disks, ROMs, random access memories (RAMs), erasable programmable read-only memories (EPROMs), electrically erasable programmable read-only memories (EEPROMs), magnetic or optical cards, flash memory, or other type of machine-readable medium suitable for storing electronic instructions. Moreover, embodiments of the present invention may also be downloaded as a computer program product, wherein the program may be transferred from a remote computer to a requesting computer by way of data signals embodied in a carrier wave or other propagation medium via a communication link.
The phrases “in some embodiments,” “according to some embodiments,” “in the embodiments shown,” “in other embodiments,” “in some examples,” and the like generally mean the particular feature, structure, or characteristic following the phrase is included in at least one embodiment of the present invention, and may be included in more than one embodiment of the present invention. In addition, such phrases do not necessarily refer to the same embodiments or different embodiments.
While detailed descriptions of one or more embodiments of the invention have been given above, various alternatives, modifications, and equivalents will be apparent to those skilled in the art without varying from the spirit of the invention. For example, while the embodiments described above refer to particular features, the scope of this invention also includes embodiments having different combinations of features and embodiments that do not include all of the described features. Accordingly, the scope of the present invention is intended to embrace all such alternatives, modifications, and variations as fall within the scope of the claims, together with all equivalents thereof. Therefore, the above description should not be taken as limiting the scope of the invention, which is defined by the claims.