The disclosure generally relates to the field of logical replication of stored data, and more particularly to a data storage reference architecture that enables efficient replication between storage platforms that may utilize storage efficiency mechanisms such as data compression and/or deduplication.
File systems are used in data processing and storage systems to establish naming conventions, protocols, and addressing that determine how data is stored and retrieved. A key function of most file systems is separating data into individually addressable portions and naming each portion to enable access to each individual portion. A file system may be implemented in a dedicated storage configuration in which the file system represents a single namespace tree and retains exclusive management of one or more physical storage resources (e.g., disks, SSDs, and/or partitions thereof) which provide the underlying persistent storage for the file system. The controlling file system determines the allocation of individual storage blocks on such dedicated storage configurations.
Continual growth in storage device capacities and increasing prevalence of multi-client access to large data stores has rendered dedicated file system storage an increasingly inefficient storage management system. Growing storage capacities tend to create a need for larger file systems on larger storage allocation groups to optimize performance and storage capacity utilization. However, larger scale storage capacity and correspondingly larger centralized storage management pose issues for end user clients which may rely on or otherwise benefit performance or security wise from managing particular application data sets as logical units determined by the size and characteristics of the respective data sets.
Virtualization is utilized to abstract physical resources and to control allocation of logical resources independently of their underlying implementation. For storage systems, file volumes are virtualized to add a level of indirection between client-accessible volumes and the underlying physical storage resources. The resulting virtual file volumes may be managed independent of lower storage layers, and multiple volumes can be generated, deleted, and reconfigured within a same physical storage volume. Storage volume virtualization is achieved, at least in part, by using physical aggregate and file system layer referencing that are mutually mapped via a logical/virtual volume layer. While improving many aspects of application data storage and management, the data referencing and mapping incident to virtualization may create inefficiencies relating to the manner in which stored data is logically replicated, such as from one or more source storage volumes to one or more destination storage volumes.
Aspects of the disclosure may be better understood by referencing the accompanying drawings.
The description that follows includes example systems, methods, techniques, and program flows that embody aspects of the disclosure. However, it is understood that this disclosure may be practiced without one or more of these specific details. In other instances, well-known instruction instances, protocols, structures and techniques have not been shown in detail in order not to obfuscate the description.
Overview
Aspects of the disclosure include implementation of a file system naming schema that enables inline and/or background compression of data during or following logical replication of the data from a source volume to a destination volume. The naming schema may include an intermediate block containing multiple entries that each correspond to a logical block, referred to herein as an extent block. The intermediate block entries contain extent block numbers associated with a logical extent that is resolved by a mapper to a physical extent comprising one or more data blocks addressed by corresponding physical volume block numbers. In an aspect, the intermediate blocks disclosed herein may comprise indirect blocks in an inode structured file system.
Each of the logical extents may be processed by and within a logical volume that may form a file system instantiation within a storage server system. The logical extents may reference multiple fixed-size data blocks that may or may not be stored on contiguous physical storage blocks. The indirect blocks may comprise fixed or variable length data structures each having an extent ID that is unique within a given volume. In one aspect, an intermediate block further includes address pointers, such as in the form of extent block numbers, which collectively reference ranges and subranges of contiguous data blocks to which the logical extents are referenced.
Storage server 102 is communicatively coupled with a storage subsystem 120 comprising, in part, multiple storage devices 122a-122m and storage controller functionality (not depicted). Storage devices 122a-122m may be, for example, magnetic or optical disks or tape drives, non-volatile solid-state memory, such as flash memory, or any combination of such mass storage devices. Data stored within storage subsystem 120 is typically organized as one or more physical storage volumes comprising respective storage space allocated from storage devices 122a-122m that defines a logical arrangement of physical storage space within a storage aggregate. The storage devices, or portions thereof, within a given physical volume may be configured into one or more groups, such as Redundant Array of Independent Disks (RAID) groups that can be accessed by storage server 102 using, for instance, a RAID algorithm.
Storage server 102 includes a storage operating system (OS) 108 that implements an extent-based file system architecture to manages storage of data within storage subsystem 120, service client requests, and perform various other types of storage related operations. Storage OS 108 comprises a series of software layers executed by processor 104 to provide data paths for clients to access stored data using block and/or file access protocols. The layers include a file system 110, a RAID system layer 116, and a device driver layer 118. File system 110 is essentially a volume that may be combined with other volumes (file system instantiations) onto a common set of storage within a RAID level storage aggregate. RAID system layer 116 builds a RAID topology structure for the aggregate that guides each volume when performing write allocation. The RAID layer also presents a PVBN-to-disk block number (DBN) mapping for accessing blocks on physical storage media.
To provide for stored data backup, storage server 102 also includes a logical replication application 115 that replicates data at the file and file block level. For instance, if storage server 102 is configured as a primary server that actively handles client requests, logical replication application 115 may be programmed to send portions of modified data to a corresponding backup-side replication application executing from a backup storage server. Such replication may be performed based on periodic or asynchronous file system consistency points. If storage server 102 is configured as a backup server, logical replication application 115 may be programmed to receive and process replication requests which typically include write requests to store modified or new data as an archive version.
As depicted, storage OS 108 implements file system 110 to logically organize the data stored on storage subsystem 120 as a hierarchical structure of file system objects such as directories and files. In this manner, each file system object may be managed and accessed as a set of data structures such as on-disk data blocks that store user data. The data blocks may be organized within logical volumes within a logical volume layer wherein each logical volume may constitute an instantiation of a user file system including the file system management code and structures, as well as directories and files. Within a logical volume layer, each volume constitutes a respective volume block number space that is maintained by file system 110. File system 110 assigns a file block number for each data block in the file as offset to arrange the file block numbers in the correct sequence. File system 110 allocates sequences of file block numbers for each file and assigns volume block numbers across each volume address space. In this manner, file system 110 organizes the “on-disk” data blocks within the volume block number space as a logical volume.
Storage servers often include storage efficiency components that reduce a physical data storage footprint and thus conserve physical storage space and reduce network traffic incident to logical replication. To enable such storage efficiency features, storage server 102 further includes a compression module 117 and a deduplication module 119. While depicted as distinct blocks in the depicted aspect, compression module 117 and/or deduplication module 119 may be incorporated within or otherwise logically associated with storage OS 108 and/or logical replication application 115.
The primary function of compression module 117 is to compress data within a file across two or more data blocks. To accomplish this, compression module 117 evaluates data within a specified number of data blocks (compression group) and, if sufficient bit-level patterns are repeated, the compression group is compressed into number of physical data blocks that is less than number of corresponding logical blocks. In one aspect, compression module 117 performs inline compression in which compression groups are compressed prior to being written to storage subsystem 120. In another aspect, compression module 117 performs background compression in which compression groups are compressed following initially being written to storage subsystem 120.
Deduplication module 119 provides an alternative or complementary storage efficiency mechanism that eliminates copies of the same data unit and allocates pointers to the retained copy. For example, block-level deduplication entails identifying data blocks containing identical data, removing all but one copy across a volume, and allocating pointers to the retained block.
Logical replication, such as performed by logical replication application 115, differs from physical replication in which the entire data set comprising the file system, including all data and all logical to physical mapping are preserved from the source to the destination server. Logical replication entails a file system level transfer of file system objects such as files, directories, and file block numbers from the source logical volume to the destination logical volume. Storage efficiency mechanisms such as compression and/or deduplication may be performed with logical replication to reduce storage space consumption on the destination and to reduce network traffic. However, for conventional file systems that implement logical volumes, the logical-to-physical volume mapping places a substantial performance penalty on logical replication due to the loss of logical-to-physical address mapping that occurs during compression. Such loss of mapping results in the need, for example, to uncompress data blocks belonging to a compression group prior to overwriting a portion of that compression group.
As further depicted in
In one aspect, file system 110 implements a fixed block size, inode pointer structure for organizing access to logical and physical blocks. The inode pointer structure employs intermediate blocks in the form of indirect blocks that map file block numbers to respective logical extents comprising extent blocks addressed by extent block numbers. The architecture further includes a layer of extent blocks comprising extent block entries that may extent block numbers to physical block extents that are uniquely identified by an extent ID.
In one aspect of the present disclosure, data is stored in the form of volumes, where each volume contains one or more files and directories. As utilized herein, an aggregate refers to a pool of storage, which combines one or more physical storage devices (e.g., disks, SSDs) or parts thereof into a single logical storage object. An aggregate contains or provides storage for one or more other logical data sets at a higher level of abstraction, such as volumes. An aggregate uses a physical volume block number (PVBN) space that defines the storage space of blocks provided by the storage devices of the physical volume. Each volume uses a logical volume block space to organize those blocks into one or more higher level objects, such as files and directories. A PVBN, therefore, is an address of a physical block in the aggregate. The present disclosure describes a logical block type that is extent-based and mapped to PVBNs in a manner that enables transactional decoupling of the physical block mapping corresponding to logical extents processed at a logical volume level.
Each of the EBNs is further associated within each of the indirect block entries 204, 206, 208, and 210 with an extent ID that is unique within a given aggregate. As shown in
In one aspect, source and destination storage servers 302 and 304 are cooperatively configured to perform logical replication wherein data at a logical, file system level is replicated from source to destination. Such logical replication may be implemented, for instance, in a vaulting relationship in which destination storage server 304 is used to archive data generated, stored, and modified by source storage server 302. In addition to performing logical replication, storage servers 302 and/or 304 may implement a storage efficiency mechanism such as data compression and/or deduplication in order to maximize physical storage space efficiency and minimize replication-related network traffic over transport channel 305. However, due to the I/O performance tradeoffs inherent in using deduplication, and particularly in using data compression, source storage server 302 and destination storage server 304 may differ in their respective use of such storage efficiency mechanisms. For example, source storage server 302 may utilize only deduplication while inline compression is enabled for destination server 304 during logical replication operations.
As shown in
Memory content 310 shows how data is mapped from a file to physical storage media by source storage server 302. Namely, an ordered sequence of FBNs 312 is depicted such as may comprise a file, all or a portion of which may be modified and replicated to destination storage server 304. Indirect block entries, such as entry 314 may map each of FBNs 312 to a corresponding virtual volume block number (VVBN) within a VVBN container file 316. A single such mapping is expressly depicted for purposes of clarity. As further depicted, the same indirect block entry 314 that maps FBN1 to a VVBN1 also maps FBN1 to a PVBN1 in an aggregate 318 within the physical block layer.
In contrast to the virtualization and block mapping used by source in which there is a fixed, one-to-one mapping between an FBN and VVBN/PVBN pair, destination storage server 304 employs an extent-based architecture as depicted within memory content 320. The same sequence of file block numbers, FBN01, FBN1, FBN2, and FBN3 are utilized to represent a same file 322 as stored in FBN0, FBN1, FBN2, and FBN3 within FBN sequence 312. However, on the destination side, each of the FBNs are mapped into logical blocks via an indirect block 324 containing entries 1E2, 1E3, 1E4, and 1E5 in which the VVBN used by the source side are replaced with extent block numbers E1.0, E1.1, E1.2, and E1.3.
The destination side volume virtualization mechanism further includes an extent-to-PVBN map 330 that maps the extent block numbers to corresponding extents such as within extent map entries 332 and 334. As illustrated, each of extent map entries maps one or more extent block numbers to a PVBN within a destination side aggregate 336 independently of the FBN-to-EBN mapping provided by indirect block 324.
The foregoing destination side indirect block mapping and extent-to-PVBN mapping enables more efficient processing of logical replication when compression is enabled on destination storage server 304. For instance, consider an eight block file comprising FBN0-FBN7 stored on source storage server 302 as eight physical blocks addressed at PVBN0-PVBN7 with the same file stored on destination storage server 304 in compressed form as four physical blocks addressed at PVBN10-PVBN13. Suppose source blocks FBN2 and FBN3 corresponding to PVBN2 and PVBN 3 are modified and sent in a write request for replication to destination storage server 304. At the logical volume level the corresponding file blocks FBN2 and FBN3 remain mapped to logical extent blocks EBN2 and EBN3. However, mappings to the PVBNs are not directly maintained due to the compression to four PVBNs. Instead of requiring PVBN10-PVBN13 to be uncompressed to commit the write to storage, the extent-to-PVBN map 330 allocates an additional extent entry in which, assuming no compression, two new extent block numbers are assigned to the modified FBN2 and FBN3 and the extent block numbers are associated with the new extent ID.
In one aspect the storage server processes the write request with an inline compressing engine enabled. At block 506, the file system, in cooperation with the compression engine, identifies segments of submultiples of the data blocks that form one or more corresponding compression groups. At block 508 each of the segments of two or more data blocks are evaluated to determine compressibility based on whether sufficient bit-level repeat patterns exist among the blocks in a given compression group. For each compression group that is determined not to be compressible, the file system stores the data blocks uncompressed (block 510) at physical storage locations addressed by physical block addresses. For each compression group that is determined to be compressible, the compression engine compresses the data blocks into a smaller set of data blocks which the file system stores at physical locations addressed at physical block addresses (block 512).
In addition to storing the compressed or uncompressed blocks, the file system allocates an extent-to-PVBN map entry for each of the compression groups (block 514). An extent-to-PVBN map entry includes a field containing an extent ID that is unique within an aggregate of PVBNs. The entry further associates the extent ID with the extent block numbers assigned at block 504. At some point subsequent to writing a new file or new blocks for a file, a request to write data to modified file blocks may be received (block 516). In response to the overwrite request, the file system may read a compression flag stored with an extent map entry to determine whether the target file blocks have been compressed (block 518). In response to determining that the target blocks are not compressed within physical storage, the file system updates a corresponding extent map entry with replacement extent block numbers that will now be associated within the entry with the unchanged extent ID (blocks 520 and 522). In response to determining that the target blocks are compressed on disk, the file system allocates a new extent map entry that associates replacement extent block numbers with a new extent ID (block 524). In either case, (new extent map entry with new extent ID or replace extent block numbers only), the file system accesses the intermediate block to replace the previous extent block numbers with replacement block numbers such that each of the file block numbers from the overwrite request are respectively associated with one of the replacement extent block numbers (block 526). In the case of a newly allocated extent map entry, the intermediate block is also modified to replace the previous extent ID with the new extent ID.
At block 608, the file system uses the identified extent ID to access a corresponding extent map entry and reads a compression flag within the entry to determine whether the received data blocks have been compressed on the destination physical storage. In response to determining that the data blocks received in the write request are not compressed on destination server storage, the file system updates a corresponding extent map entry with replacement extent block numbers to be associated within the map entry with the unchanged extent ID (blocks 610 and 612). In response to determining that the data blocks are compressed on destination storage, the file system allocates a new extent map entry that associates replacement extent block numbers with a new extent ID (block 614). In either case, (new extent map entry with new extent ID or replace extent block numbers only), the file system accesses the intermediate block to replace the previous extent block numbers with replacement block numbers such that each of the file block numbers from the overwrite request are respectively associated with one of the replacement extent block numbers (block 616). In the case of a newly allocated extent map entry, the intermediate block is also modified to replace the previous extent ID with the new extent ID.
The file to which the overwritten data blocks may include other data blocks contained within another physical and logical extent. In such a case, and as shown at block 618, the file system associates the extent map entry for a newly created extent with the other extent map entries that map extent block numbers of other blocks of the file to the previously existing extent(s). The file system may then receive a read request for blocks within the file that span between physical extents that are each identified with respective extent IDs (block 620). In response to such a read request, the file system, in cooperation with a local compression engine, uncompresses data within the requested data blocks that are pointed to by different extend IDs (block 622). The uncompressed data is then written to physical block locations having corresponding physical block addresses (block 624). As shown at block 626, the physical block addresses are mapped to corresponding extent block numbers within the extent map entries corresponding to the different extent IDs.
Variations
The flowcharts are provided to aid in understanding the illustrations and are not to be used to limit scope of the claims. The flowcharts depict example operations that can vary within the scope of the claims. Additional operations may be performed; fewer operations may be performed; the operations may be performed in parallel; and the operations may be performed in a different order. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by program code. The program code may be provided to a processor of a general purpose computer, special purpose computer, or other programmable machine or apparatus.
As will be appreciated, aspects of the disclosure may be embodied as a system, method or program code/instructions stored in one or more machine-readable media. Accordingly, aspects may take the form of hardware, software (including firmware, resident software, micro-code, etc.), or a combination of software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” The functionality provided as individual modules/units in the example illustrations can be organized differently in accordance with any one of platform (operating system and/or hardware), application ecosystem, interfaces, programmer preferences, programming language, administrator preferences, etc.
Any combination of one or more machine readable medium(s) may be utilized. The machine readable medium may be a machine readable signal medium or a machine readable storage medium. A machine readable storage medium may be, for example, but not limited to, a system, apparatus, or device, that employs any one of or combination of electronic, magnetic, optical, electromagnetic, infrared, or semiconductor technology to store program code. More specific examples (a non-exhaustive list) of the machine readable storage medium would include the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a machine readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. A machine readable storage medium is not a machine readable signal medium.
A machine readable signal medium may include a propagated data signal with machine readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A machine readable signal medium may be any machine readable medium that is not a machine readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a machine readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as the Java® programming language, C++ or the like; a dynamic programming language such as Python; a scripting language such as Perl programming language or PowerShell script language; and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on a stand-alone machine, may execute in a distributed manner across multiple machines, and may execute on one machine while providing results and or accepting input on another machine.
The program code/instructions may also be stored in a machine readable medium that can direct a machine to function in a particular manner, such that the instructions stored in the machine readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
While the aspects of the disclosure are described with reference to various implementations and exploitations, it will be understood that these aspects are illustrative and that the scope of the claims is not limited to them. In general, techniques for an object storage backed file system that efficiently manipulates namespace as described herein may be implemented with facilities consistent with any hardware system or hardware systems. Many variations, modifications, additions, and improvements are possible.
Plural instances may be provided for components, operations or structures described herein as a single instance. Finally, boundaries between various components, operations and data stores are somewhat arbitrary, and particular operations are illustrated in the context of specific illustrative configurations. Other allocations of functionality are envisioned and may fall within the scope of the disclosure. In general, structures and functionality shown as separate components in the example configurations may be implemented as a combined structure or component. Similarly, structures and functionality shown as a single component may be implemented as separate components. These and other variations, modifications, additions, and improvements may fall within the scope of the disclosure.
This application is a continuation-in-part of U.S. patent application Ser. No. 14/286,900, filed on May 23, 2014, titled “OVERWRITING PART OF COMPRESSED DATA WITHOUT DECOMPRESSING ON-DISK COMPRESSED DATA,” which is a continuation of U.S. patent application Ser. No. 13/099,283, filed on May 2, 2011, titled “OVERWRITING PART OF COMPRESSED DATA WITHOUT DECOMPRESSING ON-DISK COMPRESSED DATA,” the content of both of which is incorporated by reference herein.
Number | Date | Country | |
---|---|---|---|
Parent | 13099283 | May 2011 | US |
Child | 14286900 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 14286900 | May 2014 | US |
Child | 14929018 | US |