Typical computer file systems store file data in small, fixed-size blocks, referred to by pointers maintained in metadata associated with each file. In the event two pointers refer to identical blocks, some storage capacity can be reclaimed by changing one or both pointers so that they refer to the same block. The process of finding pointers that refer to identical blocks and then changing one or both pointers so that they point to the same block is referred to herein as “deduplication”. Such deduplication is typically performed by a gateway that controls access by host computers to the storage medium.
In one of its aspects, the present invention provides a data center comprising plural computer hosts and a storage system external to said hosts, said storage system including storage blocks for storing tangibly encoded data blocks, each of said hosts including a host operating system with an deduplicating file system driver installed. The file system driver, referred to through the specification and drawings simply as “file system”, identifies identical data blocks stored in respective storage blocks. The file system merges such identical data blocks into a single storage block so that a first file exclusively accessed by a first host and a second file accessed exclusively by a second host concurrently refer to the same storage block.
In another of its aspects, the present invention provides a manufacture comprising computer-readable storage media encoded with a file system of computer-executable instructions. The file system, when executed on a host computer system, connects to a storage system managing files, including a shared-block file, encoded in said storage system. The files contain tangibly encoded metadata pointers referring to storage blocks containing tangibly encoded data blocks, said shared-block file having metadata pointers referring to blocks referred to by plural of said metadata pointers, said file system including a write-log handler for updating a hash index having a shared set of entries referring to shared storage blocks indirectly through said shared-block file, and having an unshared set of entries referring to unshared storage blocks indirectly through said files other than said shared-block file, said hash index being tangibly encoded in said storage system.
In another aspect, the invention provides a method comprising a first file system executing on a first host computer system, said first file system managing a first write operation to a first file on a storage system by writing a first data block to a first storage block of said storage system and causing a first metadata pointer of said first file to refer to said first storage block, a second file system executing on a second host computer system managing a second write operation to a second file on a storage system by writing second contents to a second block of said storage system and causing a second metadata pointer of said second file to refer to said second block, and said second file manager determining whether or not said second contents are identical to said first contents, if said second contents are identical to said first contents, said second file manager causing said second metadata pointer to refer to said first block.
In still another aspect, the invention provides a method of performing deduplication operations in a computer system having multiple host systems connected to a common storage system, the method comprising the steps of maintaining a hierarchical data structure including a low-level data structure and one or more higher level data structures, and at each host system, tracking write operations to the common storage system during a period of time and asynchronously performing deduplication operations on storage blocks that are written in connection with the write operations using the hierarchical data structure.
Further embodiments of the invention include a computer system and a computer readable medium in which instructions for the computer system are stored. The computer includes a plurality of host systems connected to a common storage system, wherein each host system is programmed to track write operations to the common storage system during a period of time and asynchronously performing deduplication operations on storage blocks that are written in connection with the write operations using a hierarchical data structure that is stored in the common storage system and includes a low-level data structure and one or more higher level data structures. The instructions stored in the computer readable medium cause the computer system to maintain a hierarchical data structure including a low-level data structure and one or more higher level data structures, and to track write operations to a storage system during a period of time and asynchronously perform deduplication operations on storage blocks that are written in connection with the write operations using the hierarchical data structure.
A data center AP1 embodying the present invention is depicted in
As those skilled in the art will surmise, the invention provides for a great variety of data-center and other computer-system topologies. The invention provides for data centers with any number of hosts and the hosts can vary from each other, e.g., in the power and type of hardware involved, the number and types of applications and operating systems run, and schemes for networking the hosts. For example, using virtual-machine technology, one host can run several applications on respective operating systems, all sharing the same file system.
Applications 17A and 17B and operating systems 19A and 19B store data in files such as files FA, FB, and FC. File systems 20A and 20B divide the data into fixed-size blocks, 4 kB in this embodiment, and store it as data blocks D1-DN in respective storage blocks B1-BN. A file is associated with its contents by metadata block pointers. For example, file FA includes a block pointer PA1 that is associated with an offset location within file FA. Block pointer PA1 refers to storage block B1, which contains data block D1. (Note: the dashed arrows represent prior associations between pointers and blocks, while the solid arrows represent current associations between pointers and blocks.) A file typically has many pointers, and more than one of those can refer to a given storage block; for example, file FA includes pointers PA2 and PA3, both of which refer to storage block B2. It is also possible for two pointers from different files to point to the same block; for example, pointer PA4 of file FA and pointer PB1 of file FB both refer to storage block B4.
As indicated by two-way arrows 21 and 23, communications with SAN 11 by hosts HA and HB are largely independent. To prevent conflicting file accesses, hosts HA and HB are prevented from concurrently accessing the same file. To this end, each file includes a lock that can be owned by a host. Although a file can be accessed by at most one host any given time, hosts HA and HB can time-share (access at different times) a file, e.g., file FC, by releasing and acquiring locks. For mnemonic and expository purposes, two files are treated herein as “permanently” owned by respective hosts: host HA permanently owns lock LA, so host HB can never access file FA; likewise, host HB permanently owns lock LB, so host HA can never access file FB. “Permanently” here means “for the entire duration discussed herein”.
In data center AP1, deduplication is decentralized. Each host HA, HB has its own deduplicating file system 20A, 20B. There are several advantages over a centralized approach. No specialized hardware is required to handle deduplication. There is no central host that might become a bottleneck or a single-point of failure for data center AP1. Furthermore, the present invention scales conveniently as adding more hosts inherently contributes more resources to the deduplication function.
Deduplication can be effected according to the following example. Prior to deduplication, pointer PA2 referred to storage block B2, and thus to data block D2, while pointer PA3 referred to storage block B3 and thus to data block D3. During a deduplication operation 25, it is determined that data block D3 is equivalent to data block D2. Data block D3 is then effectively merged with data block D2 in storage block B2 by changing block pointer PA3 so that it refers to storage block B2. Storage block B3 is thus freed for another use. Deduplication operation 25 was executed by host HA, while it had exclusive access to file FA, which includes as metadata all block pointers involved in operation 25.
However, the present invention does not require one host to have access to both files involved in a deduplication operation. For example, host HA can discover that storage block B5 and storage block B4 are likely to contain equivalent data blocks even though no file that host HA has access to refers to storage block B5. This discovery of likely equivalence can be made through deduplication-specific files 27. Host HA can record this likely equivalence by issuing a merge request and storing it in one of deduplication-specific files 27. Once host HB can obtain access to the merge request, host HB can determine whether the proposed equivalence is valid and, if so, and change block pointer PB1 (which host HB has access to) to point to storage block B4 to effect deduplication operation 29. Thus, although acting independently, hosts HA and HB can cooperatively implement deduplication by time-sharing deduplication-specific files 27
Due to the large numbers of storage blocks typically handled by a storage system, it is not practicable to compare every possible pair of blocks for possible duplicates. However, since new duplicates only (or at least primarily) arise in the context of write operations, deduplication candidates can be identified by tracking write operations. In an embodiment of the invention, each block is checked for possible matches as part of the write operation. However, the illustrated embodiment monitors write operations but defers deduplication to a time when demand on computing resources is relatively low to minimize any performance penalty to applications 17A and 17B.
Write Operations
Thus, in a method ME1, as flow-charted in
At step S11A, application 17A initiates a write operation, e.g., of data to file FA. The write operation involves writing data to a location with a file stored on SAN 11. Write operations initiated by application 17A may be: 1) confined to a single block; or 2) encompass multiple blocks or at least cross a block boundary. In the latter case, file system 20A breaks the write operations into single-block suboperations, each of which are treated as described below for a single-block write operation. Similarly, the range of write addresses asserted by application 17A is converted to file pointers. Each file pointer specifies a file identifier (file ID) and an offset value (indicating a location within the specified file). Associated with each such file location is metadata defining a block pointer that refers to a 4 kB storage block (B1-BN).
At step S12A, file system 20A detects the write operation and generates a write record. In the process, file system 20A generates a hash of the data block and associates it with the file pointer derived from the write request. In the illustrated embodiment, a write record is only generated for write operations in which an entire block is overwritten. No write record and no ensuing deduplication occurs in response to a write of a partial block. In an alternative embodiment, in the case where a write operation involves only a portion of a block, the remainder of the block must be read in to generate the hash. File system 20A uses an SHA-1 algorithm that generates 160-bit hashes, also known as “fingerprints”, “signatures”, and “digests”, so comparisons are between 20-byte values as opposed to 4 kB values. Two blocks with different hashes are necessarily different. SHA-1 hashes are collision resistant, so it is very unlikely that two blocks with the same hash will be different. To avoid any possibility of a mismatch, bit-wise comparisons of the full blocks can optionally be used to confirm a match indicated by a comparison of hashes. SHA-1 hashes also have security-related cryptographic properties that make it hard to determine a block from its hash. Alternative embodiments use other hash algorithms, e.g., SHA-2, and MD5.
At step S13A, file system 20A accesses the block pointer referred to by the file pointer that file system 20A derived from the write address range specified by application 17A. Thus, for example, a write of block D4 to file FA and an offset associated with block pointer PA4, host HA would access block pointer PA4.
File systems 20A and 20B distinguish between copy-on-write (COW) block pointers and “mutable” block pointers. A mutable-type pointer indicates that the target storage block can be overwritten. A COW-type pointer indicates that the target storage block must not be overwritten. For example, a storage block such as B2 in
At step S14A, file system 20A determines whether: 1) the write operation can be performed in place, i.e., the target block can be overwritten; or 2) the write operation must be performed on a copy of the target block, e.g., because other files referring to the block expect it to remain unchanged. In the illustrated embodiment, this determination is made by examining the COW vs. mutable type of the block pointer accessed in step S13A. If the pointer is mutable, the data block specified in the write operation overwrites the contents of the storage block referred to at step S15A. If the block pointer type is COW, a copy-on-write operation is performed and the data block is written to a free storage block at step S16A. The block pointer accessed in S13A is changed to refer to the new storage block at step S17A; its type remains “mutable”. A storage-block reference count associated with the newly used storage block is incremented from “0” (“free”) to “1” (“unique”), at step S18A. Also, at step S18A, a storage-block reference count associated with the copy-source block is decremented, as one fewer block pointer refers to it.
At step S19A, the write record generated in step S12A is transferred from host HA to SAN 11. Typically, write records accumulate at the host where they are organized by destination file. The write records are then transferred to write logs on SAN 11 for their respective files. The write records are subsequently used during deduplication operation S2A, typically scheduled for low utilization times, e.g., 2 am. Method ME1 analogously provides for steps S1B, S2B, and S11B-S19B for implementation by host HB.
Data Center Detail
As shown in
SAN11 includes storage blocks including blocks B1 and B2, file sets including file sets FSA and FSB, a hash index 45, and a shared-block or “pool” file FP. Hash index 45, pool file FP, write logs WLA and WLB, and merge logs MLA and MLB are examples of deduplication-specific files FS (
Write logs, e.g., write logs WLA and WLB, and merge logs, e.g., merge log MLA and WLB, are files with structures analogous to characteristic files. In other words, their contents, including write records and merge requests, are arranged in data blocks that are, in turn, stored in storage blocks B1-BN. The write logs and merge logs include metadata block pointers that refer to the storage blocks that store the write records and merge requests. For expository purposes, the characteristic files (e.g., FA and FB) are considered herein in their physical aspect (e.g., with metadata block pointers), while ancillary files, e.g., write logs and merge logs, are considered herein in their logical aspect, i.e., with direct reference to contents.
Write logs WLA and WLB are written to when storing write records and read from when processing those records during deduplication. They are also read from to discover hash-index entries that can be purged. The ownership of write log files follows ownership of the associated main files. Thus, host HA, for example, has exclusive access to write log WLA as long as it has exclusive access to file FA.
All other deduplication-specific files are accessible from both hosts HA and HB on a time-share basis (i.e., at different times, both host HA and host HB have exclusive access to these deduplication-specific files), whether or not the associated main files are. For example, host HA can access merge-request log MLB on a time-share basis even though it cannot access file FB at all. This allows host HA to store a merge request for handling by host HB.
File sets FSA and FSB are shown in more detail in
Merge log MLA includes merge requests MA1 and MA2, while merge log MLB includes merge requests MB1 and MB2. Each merge request MA1, MA2, MB1 specifies two file pointers: a “local” file pointer ML1, ML2, MBL, and a “pool” file pointer MP1, MP2, MBP. The local file pointer refers to a location in the associated characteristic file. For example, local file pointer ML1 points to an offset within characteristic file FA. (Note that since each ancillary file (write log or merge log) is associated with only one characteristic file, the local file pointer need only specify explicitly an offset.) The pool file pointer refers to a location within pool file FP.
The local file pointers and pool file pointers refer directly to file locations with associated block pointers. Thus, the local file pointers and pool file pointers refer indirectly to storage blocks. In an alternative embodiment, a merge request includes the block pointer from the pool file instead of a pool-file block pointer. In other words, in the alternative embodiment, merge requests refer to storage blocks directly rather than indirectly through an intermediate file (e.g., FA or FB).
Hash index 45 serves, albeit on a delayed basis, as a master list of all used storage blocks. Hash index 45 includes entries 47, 49, etc., assigning hash values to file pointers. The file pointers refer to file locations associated with block pointers associated with storage blocks associated with data blocks that are represented by the hashes. In other words, hash index 45 indirectly indexes storage blocks by their contents.
Hash index 45 is divided into horizontal shards 51 and 53. Each shard covers a pre-determined range of hash values, e.g., shard 51 includes hash values beginning with “0” while shard 53 includes hash values beginning with “1”. Dividing the hash index allows both hosts HA and HB to access respective shards concurrently and then switch so that each host has access to all entries. The number of shards into which a hash index is divided can be larger for greater numbers of hosts so that all or most hosts can access respective parts of the hash index concurrently.
In an alternative embodiment, each hash index is explicitly associated with a list of all file pointers that refer to respective block pointers to the same block. In the illustrated embodiment, only one file pointer is listed per hash value. For hash values that are associated with more than one block pointer, the associated file pointers points to a pool file location. A block pointer associated with that pool file location refers to the common block referenced by those block pointers.
Pool file FP, like other files FA and FB, includes a lock LP and block pointers PS1 and PS2. Basically, hash index entries, e.g., 47, 49 refer either to pool files or other files. Hash index entries that refer to pool file FP refer to COW-type block pointers, while hash index entries that refer to other files refer to mutable-type block pointers. The COW-type pointers refer to blocks that are or at least were shared; the mutable block-type pointers refer to blocks that are not shared. In an alternative embodiment, there is no pool file and a hash index entry lists all file pointers associated with shared blocks.
Finding a Match
Before storage block contents can be merged, they must be determined to be identical. To this end, content hashes are compared; more specifically, the hash in a write record is compared to possibly matching hashes in hash index 45. Thus, as shown in
At step S21A, file system 20A identifies files to which host HA has exclusive access, e.g., by checking locks. At step S22A, write-log handler 39A, accesses write records in write logs of accessible files; only those write records having hashes in the range of the accessed shard are processed until a different shard is accessed. In embodiments in which the hash index is not divided into shards, all accessible write records can be accessed. Even in embodiments in which hashes that are not broken down into shards, the hashes can be ordered so that only a fraction of the hash entries need to be checked to establish a “miss” (no matching index entry).
At step S23A, for each write record, a determination is made whether or not the hash in the record matches a hash value in hash index 45. If there is no match, then the data block corresponding to the write record is unique. No deduplication is possible; however, the hash index is updated at step S24A to include a new entry corresponding to the write record. The entry includes the hash value and the file pointer of the write record. This completes processing of the subject write record. The next steps are handling merge requests at step S25A and purging deduplication-specific files FD. These two steps are discussed further below.
If, at step S23A, a match is found, then the file pointer associated with that hash in the hash index is accessed at step S27A. Referring to
Write-record file pointer FA1 specifies a file (file FA) and an offset in that file at which block pointer PA4 is located. Block pointer PA4 refers to storage block B4 that contains data block D4. Herein, “WR file”, “WR offset”, “WR block pointer”, “WR storage block” and “WR data block” all refer to entities specified by or directly or indirectly referred to by a write record. Likewise, a prefix “IE” refers to entities specified by or referred to by an index entry file pointer in its original form. If an index entry file pointer has been revised, the prefix “RE” is used.
Match Points to Unique Storage Block
In effect, a write record that does not match any pre-existing index entries is itself entered into hash index 45. Initially, the new entry specifies the same file pointer (file and offset) that the write record specifies. This entry remains unchanged until it is matched by another write record. In the meantime, the IE file pointer refers to the original mutable-type WR block pointer that, in turn, refers to a WR storage block. However, since the WR block pointer is mutable, the WR data block may have been overwritten between the time the write record was generated and the time the match was recognized. In this case, the match between the WR hash and the IE hash is obsolete.
If the host processing the write record does not have access to the IE file, the host will not be able to determine whether or not the hash-index entry is obsolete. For example, if host HA is processing a write record for file FA and if that write record matches a hash-index entry that refers to file FB, host HA will, in effect, need the help of host HB if the validity of the index entry is to be determined. However, since hosts HA and HB access SAN 11 independently, this cooperation cannot depend on cooperative action. Instead, host HA makes its information available by copying its block pointer to pool file FP and transferring the rest of the deduplication task to host HB in the form of a merge request.
When, at step S27A, write-log handler 39A determines that the IE file is not pool file FP, method ME1 continues at step S28A, as shown in
At step S29A, the type of the WR block pointer is changed from “mutable” to “COW”. At step S30A, this newly COW-type WR block pointer is added to pool file FP so that it is accessible by all hosts. Since the WR file and the pool file now share the WR storage block, its count is incremented to “2”.
At step S31A, the IE file pointer is changed to refer to the pool file. (Note: it is this step that leads to the inference that an index entry that refers to a file other than pool file FP has not been matched previously.) This resulting revised-entry RE file pointer now points to the WR storage block. For example, if host HA is processing a write record referring through WR block pointer PA4 to WR storage block B4 (as shown in
Since access to files is exclusive, the host processing a write request will not generally have access to IE block pointer. If the host cannot access the IE file, it cannot identify the IE storage block and cannot change the IE block pointer to match the one in the pool file (from step S30A). Accordingly, the host transfers responsibility for these tasks to a host with access to the IE file by issuing a merge request and storing it the merge log for the target file. For example, merge-request generator 41B can store a merge request in merge log MLB for handling by merge-request handler 43B of host HB at step S25B Likewise, merge request generator 41B of host HB can store merge requests in merge log MLA for handling by merge-request handler 43A of host HA at step S25A.
In an alternative embodiment, a host completes match determinations when it has access to the file referred to by the index entry. Thus, merge requests are only issued when the non-pool file referenced by an index entry is inaccessible to the host processing the write record.
Handling Merge Requests
Steps S25A and S25B include several substeps, herein referred to as “steps”, as shown in
If the comparison disconfirms the equality of the IE data block and the RE data block, host HB discards the merge request without performing any deduplication at step S36B. The IE block pointer and the IE storage block remain unchanged in response to the merge request. The WR storage block remains “shared” by pool file FP and the WR file. In an alternative embodiment, the hash index and the pool file revert to their respective states before processing of the write record that resulted in the merge request.
If the comparison at step S35B confirms the match, the IE block pointer in the exclusive file is conformed to the COW-type block pointer in the pool file at step S37B. At step S38B, block counts are adjusted. The IE storage block that had been referred to by one pointer is now referred to by zero, so its count is decremented from “1” to “0”. The IE storage block referred to in the pool file has its count incremented from “2” to “3”.
For example, if, in the course of processing a merge request, host HB determines that the contents of storage block B5 still correspond to the index-entry hash, pointer PB1 will be changed from pointing to storage block B5 to storage block 54, as in deduplication operation 29 of
Handling a Match that Refers to the Pool File
When a write record matches a unique index entry, the index entry is changed so that it refers to the pool file instead of its original file. In the illustrated embodiment, index entries do not change in the other direction. In an alternative embodiment, storage blocks referred to by the pool file and only one other block pointers revert back to having unique index entries and are treated in the same manner as an original unique entry.
If at step S27A, the matching index entry refers initially to pool file FP, a new hash is generated from the contents of the WR storage block at step S39A, shown in
If at step S39A, the match is confirmed (not obsolete), host HA accesses the IE block pointer in the pool file entry at step S41A. The WR block pointer is updated to match the IE block pointer in pool file FP at step S42A. At step S43A, the IE storage block count is incremented. At step S44A, the count for the WR storage block is decremented to zero, and that block is freed. Host HB can implement analogous steps S39B-S44B for handling matches to shared entries.
Purging
Purge steps S26A and S26B are flow charted in
Each attempt to overwrite the shared block yields a COW operation so that one less pointer refers to the original storage block; in this case, the count is decremented by one. Thus, a COW operation can drop a count from “3” to “2”; the next COW operation on that block can drop the count from “2” to “1”, corresponding to the fact that only the pool file now points to the storage block. Since no other file points to that block, it can be freed by decrementing its counters to “0”, and purges corresponding entries in the pool file and the index.
In addition, an unprocessed write log may indicate that a unique storage block has been overwritten. If, before that write log is processed, another write log matches the hash for the unique storage block, method ME1 will determine that there is no match. This effort can be avoided by simply purging unique index entries for which the file pointer matches the file pointer of an unprocessed write record.
Accordingly, purge method S26A involves host HA scanning hash index 45 at step S45A. Purging hash index entries with file pointers that match those of unprocessed write records at step S46A. At step S47A, hash index entries corresponding to shared blocks with a count of “1” are purged along with the referenced pool file entries. Also at this step, the referenced storage block is freed by setting its count to “0”. Steps S45B-S47B are performed analogously by host HB.
Mixed File Block Size Support
In file systems, data is managed in blocks of some fixed size. For example, some commonly used file systems use 4 kilobyte blocks and some other file systems (e.g., VMware™ VMFS) use bigger blocks such as 1 megabyte blocks. Managing data in larger size blocks simplifies many read and write operations and reduces the amount of metadata needed to keep track of stored data. However, deduplication tends to be more effective when smaller blocks are used as the probability of finding two matching data blocks is higher.
In one embodiment, to make the file system aware of this block fragmentation, a flag is stored in inode 150 to indicate that a pointer in inode 150 now points to a fragment pointer block. In one embodiment, this flag is stored in the pointer that points to the fragment pointer block. In this embodiment, if the flag is set to a particular state (e.g, yes or no or 0 or 1, etc.), the file system adjusts itself to manage multiple smaller blocks. In one example, consider a direct file whose inode consists of pointers to 1 megabyte file blocks. To individually address a 4 kilobyte block at an offset of 1032 kilobytes into the file, the second 1 megabyte block of the file is divided into 256 four kilobyte blocks. A fragment pointer block is allocated to store the pointer to the 256 small blocks and the pointer to the original 1 megabyte block is replaced with a pointer to the fragment pointer block.
Random Updates to Hash Index
The deduplication method according to embodiments of the invention operates out-of-band. Thus, there is a delay between when duplicate data blocks are written to storage blocks and when these duplicate data blocks are detected and reclaimed. Typically, the deduplication is carried out in batch once a day, during time of low hardware utilization, e.g., 2 am. Also, in order to improve batch performance, hash index 45 is maintained as a sorted, sequential file.
Another embodiment of the invention employs a hash index that is a variant of a B+ tree to support both efficient batch updates as well as efficient random updates. When performing deduplication using this hash index, the system can be configured to automatically select the faster way to update the index, either sequentially (which is likely when deduplication is infrequently performed, such as on the order of days) or randomly (which is likely when deduplication is performed more frequently, such as on the order of seconds).
More frequent deduplication may be beneficial during periods when large amounts of temporary duplicate data are created (situation known as “temporary file system bloat”), and the system may not have enough storage space to accommodate the temporary spike in demand. This may occur for example when hosts HA, HB are configured to run virtual machines and one or more new virtual machines are being instantiated in them.
A conceptual diagram of the hash index that is a variant of a B+ tree is illustrated in
Referring to
Another index, referred to herein as a jump index 120, is maintained at a higher level from hash index 110. One or more jump indices 120 (two are shown in
When performing a sequential update, the host first reads the entries of hash index 110 and its write log, and then updates hash index 110 in the same manner as hash index 45, except hash index 110 is divided into a plurality of large pages and free space 112 is provided at the end of each large page. When performing a random update, the host reads the entries in its write log and uses jump indices 120, 130, etc. to locate the large pages that need to be updated. If an update to a large page does not cause the large page to overflow, i.e., free space 112 is sufficient to absorb the new entries into hash index 110 resulting from the update, the large page is updated. If the update causes the large page to overflow, i.e., the size of the large page with the new entries is greater than the allocated large page size, the large page is split into equal halves, the first of which is written over the original large page and the second of which is appended to the end of hash index 110, and the jump indices are updated. A similar splitting of the jump indices may occur if the number of entries added to hash index 110 is large. To prevent this from occurring, in some embodiments, free space is also provided at the end of the data structure of the jump indices.
Prior to the start of this method, a time period for updates is specified and stored. This time period can be specified on a per-file, per-host, or per-cluster basis. In the example given here, the time period is specified for the host. At step S130, the host determines if the time for a next update has been reached. If so, the flow proceeds to step S131. At step S131, a selection is made between sequential update or random update. In one embodiment, if the time period for updates is greater than or equal to a threshold time period, sequential update is carried out (steps S132-S133). Otherwise, random update is carried out (steps S134-S140). The threshold time period can be configured by the user or dynamically determined from IO metrics learned from previous index updates.
At step S132, the first step of sequential update, the host accesses the hash index of
At step S134, the first step of random update, the host accesses its write log and reads the entries of the write log. Then, steps S134-S140 are carried out for each entry in the write log. At step S135, the host examines the jump indices of the hash index to locate the large pages that need to be updated for each entry in the write log. Overflow of the large page is determined at step S135. If the update does not cause the large page to overflow, the large page is updated (step S136). However, if the update causes the large page to overflow, the large page is split into equal halves (step S137). After the split, at step S138, the first half of the large page is written over the original large page and the second half of the large page is appended to the end of the hash index. At step S139, the jump indices are updated. After steps S136 and S139, the flow returns to step S130.
Herein, a “hash” index is a file or other data structure that associates (directly or indirectly) hashes with the (present or past) storage-block locations of data blocks used to generate or that otherwise correspond to the hashes. Herein, a “shared-block file” or “pool file” (elsewhere referred to as an “arena”) is a file with pointers that refer (directly or indirectly) to storage blocks that are know to be or have been shared by different locations within the same file and/or by different files. In the illustrated embodiment, a hash-index entry can refer indirectly to a shared storage block by referring directly to a pool-file location having an associated metadata block pointer that refers directly to the shared storage block.
In an alternative embodiment, each file with redirected file pointers has a corresponding hidden file that indicates which parts of the file are being shared and refer off to a special pool-like file. All reads and writes go through a filter layer that is aware of these hidden files. The combination of the underlying file system and this filter layer is functionally equivalent to the illustrated file system that supports pointer rewriting and COW. In effect, the filter system serves as a file system that uses another file system as its storage medium instead of using the disk directly. These and other variations upon and modifications to the illustrated embodiment are provided by the present invention, the scope of which is defined by the following claims.
In one or more embodiments, programming instructions for executing above described methods and systems are provided. The programming instructions are stored in a computer readable media.
With the above embodiments in mind, it should be understood that one or more embodiments of the invention may employ various computer-implemented operations involving data stored in computer systems. These operations are those requiring physical manipulation of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. Further, the manipulations performed are often referred to in terms, such as producing, identifying, determining, or comparing.
Any of the operations described herein that form part of one or more embodiments of the invention are useful machine operations. One or more embodiments of the invention also relates to a device or an apparatus for performing these operations. The apparatus may be specially constructed for the required purposes, such as the carrier network discussed above, or it may be a general purpose computer selectively activated or configured by a computer program stored in the computer. In particular, various general purpose machines may be used with computer programs written in accordance with the teachings herein, or it may be more convenient to construct a more specialized apparatus to perform the required operations.
The programming modules and software subsystems described herein can be implemented using programming languages such as Flash, JAVA™, C++, C, C#, Visual Basic, JavaScript™, PHP, XML, HTML etc., or a combination of programming languages. Commonly available protocols such as SOAP/HTTP may be used in implementing interfaces between programming modules. As would be known to those skilled in the art the components and functionality described above and elsewhere herein may be implemented on any desktop operating system such as different versions of Microsoft Windows™, Apple Mac™, Unix/X-Windows™, Linux™, etc., executing in a virtualized or non-virtualized environment, using any programming language suitable for desktop software development.
The programming modules and ancillary software components, including configuration file or files, along with setup files required for providing the method and apparatus for troubleshooting subscribers on a telecommunications network and related functionality as described herein may be stored on a computer readable medium. Any computer medium such as a flash drive, a CD-ROM disk, an optical disk, a floppy disk, a hard drive, a shared drive, and storage suitable for providing downloads from connected computers, could be used for storing the programming modules and ancillary software components. It would be known to a person skilled in the art that any storage medium could be used for storing these software components so long as the storage medium can be read by a computer system.
One or more embodiments of the invention may be practiced with other computer system configurations including hand-held devices, microprocessor systems, microprocessor-based or programmable consumer electronics, minicomputers, mainframe computers and the like. The invention may also be practiced in distributing computing environments where tasks are performed by remote processing devices that are linked through a network.
One or more embodiments of the invention can also be embodied as computer readable code on a computer readable medium. The computer readable medium is any data storage device that can store data, which can thereafter be read by a computer system. Examples of the computer readable medium include hard drives, network attached storage (NAS), read-only memory, random-access memory, CD-ROMs, CD-Rs, CD-RWs, DVDs, Flash, magnetic tapes, and other optical and non-optical data storage devices. The computer readable medium can also be distributed over a network coupled computer systems so that the computer readable code is stored and executed in a distributed fashion.
While one or more embodiments of the present invention have been described, it will be appreciated that those skilled in the art upon reading the specification and studying the drawings will realize various alterations, additions, permutations and equivalents thereof. It is therefore intended that embodiments of the present invention include all such alterations, additions, permutations, and equivalents as fall within the true spirit and scope of the invention as defined in the following claims. Thus, the scope of the invention should be defined by the claims, including the full scope of equivalents thereof.
This application is a continuation of U.S. patent application Ser. No. 12/783,392, filed May 19, 2010, which is a continuation-in-part of U.S. patent application Ser. No. 12/356,921, filed Jan. 21, 2009, and claims the benefit of U.S. Provisional Patent Application Ser. No. 61/179,612, filed May 19, 2009. The entire contents of both of these applications are incorporated by reference herein.
Number | Name | Date | Kind |
---|---|---|---|
5584005 | Miyaoku | Dec 1996 | A |
5835765 | Matsumoto | Nov 1998 | A |
6075938 | Bugnion et al. | Jun 2000 | A |
6591269 | Ponnekanti | Jul 2003 | B1 |
6789156 | Waldspurger | Sep 2004 | B1 |
6792432 | Kodavalla | Sep 2004 | B1 |
6934880 | Hofner | Aug 2005 | B2 |
6996536 | Cofino et al. | Feb 2006 | B1 |
7111206 | Shafer et al. | Sep 2006 | B1 |
7275097 | Peake, Jr. | Sep 2007 | B2 |
7287131 | Martin et al. | Oct 2007 | B1 |
7567188 | Anglin et al. | Jul 2009 | B1 |
7600125 | Stringham | Oct 2009 | B1 |
7720892 | Healey, Jr. et al. | May 2010 | B1 |
7734603 | McManis | Jun 2010 | B1 |
7747584 | Jernigan, IV | Jun 2010 | B1 |
7822939 | Veprinsky | Oct 2010 | B1 |
7840537 | Gokhale et al. | Nov 2010 | B2 |
7921077 | Ting | Apr 2011 | B2 |
8099571 | Driscoll | Jan 2012 | B1 |
8135930 | Mattox et al. | Mar 2012 | B1 |
8190835 | Yueh | May 2012 | B1 |
8266152 | Millett | Sep 2012 | B2 |
9002800 | Yueh | Apr 2015 | B1 |
9734169 | Redlich | Aug 2017 | B2 |
10437865 | Clements et al. | Oct 2019 | B1 |
10496670 | Clements et al. | Dec 2019 | B1 |
10642794 | Clements et al. | May 2020 | B2 |
10706082 | Barrell | Jul 2020 | B1 |
11449480 | Shabi | Sep 2022 | B2 |
20020087500 | Berkowitz | Jul 2002 | A1 |
20020103983 | Rege | Aug 2002 | A1 |
20030037022 | Adya et al. | Feb 2003 | A1 |
20030058277 | Bowman-Amuah | Mar 2003 | A1 |
20040107225 | Rudoff | Jun 2004 | A1 |
20050033933 | Hetrick et al. | Feb 2005 | A1 |
20050083862 | Kongalath | Apr 2005 | A1 |
20050228802 | Kezuka | Oct 2005 | A1 |
20050240966 | Hindle | Oct 2005 | A1 |
20060065717 | Hurwitz et al. | Mar 2006 | A1 |
20060085433 | Bacon | Apr 2006 | A1 |
20060143328 | Fleischer et al. | Jun 2006 | A1 |
20060206929 | Taniguchi et al. | Sep 2006 | A1 |
20060230082 | Jasrasaria | Oct 2006 | A1 |
20070033354 | Burrows et al. | Feb 2007 | A1 |
20070050423 | Whalen | Mar 2007 | A1 |
20070061487 | Moore | Mar 2007 | A1 |
20070174673 | Kawaguchi et al. | Jul 2007 | A1 |
20070239806 | Glover | Oct 2007 | A1 |
20070260815 | Guha et al. | Nov 2007 | A1 |
20070294496 | Goss et al. | Dec 2007 | A1 |
20080005141 | Zheng | Jan 2008 | A1 |
20080005201 | Ting et al. | Jan 2008 | A1 |
20080010370 | Peake | Jan 2008 | A1 |
20080059726 | Rozas et al. | Mar 2008 | A1 |
20080111716 | Artan | May 2008 | A1 |
20080195583 | Hsu et al. | Aug 2008 | A1 |
20080215796 | Lam et al. | Sep 2008 | A1 |
20080222375 | Kotsovinos et al. | Sep 2008 | A1 |
20080235388 | Fried et al. | Sep 2008 | A1 |
20080294696 | Frandzel | Nov 2008 | A1 |
20090019246 | Murase | Jan 2009 | A1 |
20090063795 | Yueh | Mar 2009 | A1 |
20090070518 | Traister | Mar 2009 | A1 |
20090171888 | Anglin | Jul 2009 | A1 |
20090204636 | Li | Aug 2009 | A1 |
20090234795 | Haas | Sep 2009 | A1 |
20090254609 | Wideman | Oct 2009 | A1 |
20090271454 | Anglin | Oct 2009 | A1 |
20090287901 | Abali et al. | Nov 2009 | A1 |
20090307184 | Inouye | Dec 2009 | A1 |
20100042790 | Mondal et al. | Feb 2010 | A1 |
20100057750 | Aasted et al. | Mar 2010 | A1 |
20100070725 | Prahlad | Mar 2010 | A1 |
20100174714 | Asmundsson | Jul 2010 | A1 |
20100257181 | Zhou et al. | Oct 2010 | A1 |
20110131390 | Srinivasan et al. | Jun 2011 | A1 |
20120096008 | Inouye | Apr 2012 | A1 |
20130332660 | Talagala | Dec 2013 | A1 |
20140160591 | Sakamoto | Jun 2014 | A1 |
20220092046 | Tomlin | Mar 2022 | A1 |
20220237155 | Kabishcer | Jul 2022 | A1 |
Number | Date | Country |
---|---|---|
9940702 | Aug 1999 | WO |
Entry |
---|
Afek et al., Dangling Pointer-Smashing The Pointer for Fun and Profit, A whitepaper from Watchfire, pp. 1-22, 2007. |
Almgren et al., A Lightweight Tool for Detecting Web Server Attacks, pp. 1-14, 2000. |
Bolosky et al., Single Instance Storage in Windows 2000, Microsoft Research, Balder Technology Group, Inc., pp. 1-12. |
Douceur et al., Reclaiming Space from Duplicate Files in a Serverless Distributed File System, Microsoft Research, Microsoft Corporation, Jul. 2002 Technical Report MSR-TR-2002-30, pp. 1-14. |
Freeman, Larry, Looking Beyond the Hype: Evaluating Data Deduplication Solutions, Netapp White Paper, Sep. 2007. |
Hong et al., Duplicate Data Elimination in a SAN File System, pp. 101-114. |
Koller et al., I/O Deduplication: Utilizing Content Similarity to Improve I/O Performance, Proceedings of FAST '10: 8th USENIX Conference on File and Storage Technologies, Feb. 26, 2010, pp. 211-224. |
Milos et al., Satori: Enlightened page sharing, Proceedings of 2009 USENIX Technical Conference, Jun. 17, 2009. Also available at <http://www.usenix.org/event/usenix09/tech/full_papers/milos/milos_html/index.html>, visited Aug. 5, 2010. |
Quinlan et al., Venti: a new approach to archival storage, USENIX Association, Proceedings of the FAST 2002 Conference on File and Storage Technologies; Monterey, CA, US, Jan. 28-30, 2002, pp. 1-14. |
Ramakrishnan, et al., Database Management Systems 3rd Edition, 2003. |
Zhu et al., Avoiding the Disk Bottleneck in the Data Domain Deduplication File System, USENIX Association, FAST 08: 6th USENIX Conference on File and Storage Technologies, pp. 269-282. |
Distributed computing, webopedia.com, Apr. 10, 2001. |
IEEE, The Authoritative Dictionary of IEEE Standards Terms Seventh Edition, 2000. |
Screenage, removing outdated ssh fingerprints from known_hosts with sed or . . . ssh-keygen, May 2008. |
Number | Date | Country | |
---|---|---|---|
20200065318 A1 | Feb 2020 | US |
Number | Date | Country | |
---|---|---|---|
61179612 | May 2009 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 12783392 | May 2010 | US |
Child | 16671802 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 12356921 | Jan 2009 | US |
Child | 12783392 | US |