The present disclosure relates to data caching techniques, and more particularly to caching techniques based on content locality.
Recent developments in solid state drives (SSDs) have been promising with rapid increases in capacity and decreases in cost. Because SSDs are implemented on a semiconductor device, SSDs provide advantages in terms of high-speed random reads, low power consumption, compact size, and shock resistance. Accordingly, current performance and cost characteristics of SSDs make them a good fit for a cache layer between system random access memory (RAM) and hard disk drive (HDD). However, traditional cache designs such as least recently used (LRU) eviction and variants do not work well for SSD cache, because SSD cache exhibits physical properties quite different from traditional RAM memories that have been used in cache designs for several decades.
Both flash memory cells and phase change memory (PCM) cells used in an SSD show asymmetrical properties in terms of read performance and write performance. For example, writes typically exhibit slower performance and resource usage (e.g., several times or an order of magnitude slower) compared with reads because of physical properties of the memory cells. In addition, write operations wear these memory cells, causing endurance problems. Take flash memory as an example. Each memory cell in flash memory may be changed in only one direction, i.e. from 1 to 0 but not vice versa. As a result, flash memory requires write operations to be performed on a clean page (e.g., a page having all 1's). The page then becomes the basic write unit for flash memory, typically sized around a few kilobytes (KB). In other words, write operations are not performed in-place. Overwriting a desired page thus is typically performed in new and clean pages in SSD. Therefore, when SSD is used as a cache having repeated read and write operations, the SSD may fill quickly. If there are no clean pages available for writes, garbage collection may be triggered. Garbage collection makes clean pages by erasing pages containing obsolete data. Such erase operations are done per unit of flash blocks, in which each flash block contains 64, 128, or more pages. Due to random reads and writes, a block chosen for erasure may contain pages with valid data. These pages with valid data may have to be moved to other blocks in order to erase the block. This phenomenon is referred to as write amplification: one write cascades into multiple writes for garbage collection. The cost of garbage collection and write amplification can be dramatic as SSD utilization approaches its full capacity.
The present disclosure relates to signature computation in a content locality based cache.
In one embodiment, the present disclosure describes a method for computing a signature of contents of a block in a cache. The method can include dividing a received block into shingles, where each shingle represents a subset of the received block. For each shingle, the can include determining an intermediate fingerprint by processing the shingle, and determining whether the intermediate fingerprint is more representative of the contents of the block than a previous fingerprint. If the intermediate fingerprint is determined to be more representative of the contents of the block, the method can include storing the intermediate fingerprint as a representative fingerprint. If the intermediate fingerprint is determined to be less representative of the contents of the block, the method can include keeping the previous fingerprint as the representative fingerprint. The method can further include determining whether there are more shingles to process. If there are more shingles to process, the method can include processing the next shingle. If there are no more shingles to process, the method can include computing the signature of the contents of the block by adding the representative fingerprint to a sketch of the received block.
In one embodiment, the present disclosure describes a circuit for computing a signature of contents of a block in a cache. The circuit can include a fingerprint circuit, a fingerprint buffer, and a comparator. The fingerprint circuit can be configured for processing a shingle of a received block, where the shingle represents a subset of the contents of the received block, and where the fingerprint circuit is configured to determine an intermediate fingerprint by processing the shingle. The fingerprint buffer can be configured for storing a previous fingerprint. The comparator can be in electrical communication with the fingerprint circuit and the fingerprint buffer. The comparator can be configured for comparing the intermediate fingerprint from the fingerprint circuit with the previous fingerprint from the fingerprint buffer to determine whether the intermediate fingerprint is more representative of the contents of the received block than the previous fingerprint. The comparator can also be configured for storing, in the fingerprint buffer, the intermediate fingerprint as a representative fingerprint for inclusion in the signature of the contents of the block, if the intermediate fingerprint is determined to be more representative.
The embodiments described herein can include additional aspects. For example, determining whether the intermediate fingerprint is more representative of the contents of the block than the previous fingerprint can include comparing the intermediate fingerprint with the previous fingerprint to determine whether the intermediate fingerprint is larger compared with the previous fingerprint, and if the intermediate fingerprint is determined to be larger compared with the previous fingerprint, the intermediate fingerprint can be determined to be more representative of the contents of the block. Determining whether the intermediate fingerprint is more representative of the contents of the block than the previous fingerprint can include comparing the intermediate fingerprint with the previous fingerprint to determine whether the intermediate fingerprint is smaller compared with the previous fingerprint, and if the intermediate fingerprint is determined to be smaller compared with the previous fingerprint, the intermediate fingerprint can be determined to be more representative of the contents of the block. Determining the intermediate fingerprint can include computing a hash value for the shingle. Determining the intermediate fingerprint can include determining a first intermediate fingerprint by performing a modulo operation between a Mersenne prime and the shingle, where the modulo operation is performed using a plurality of addition operations, determining a second intermediate fingerprint by performing a random permutation of the first intermediate fingerprint; and using the second intermediate fingerprint as the intermediate fingerprint. Performing the random permutation of the first intermediate fingerprint can include performing a bit shift operation by a random number of bits on the first intermediate fingerprint, and performing an addition operation by a random constant on the second intermediate fingerprint. Determining the intermediate fingerprint can include determining a first intermediate fingerprint by performing Rabin fingerprinting on the shingle, where the Rabin fingerprinting calculates a random irreducible polynomial based on the shingle using a plurality of shift operations and exclusive or (XOR) operations, determining a second intermediate fingerprint by performing a random permutation of the first intermediate fingerprint, and using the second intermediate fingerprint as the intermediate fingerprint. The method can further include sampling a first subset of bits from the first intermediate fingerprint, determining whether the sampled first subset of bits from the first intermediate fingerprint matches a bit mask pattern, if the sampled first subset of bits from the first intermediate fingerprint matches the bit mask pattern, determining the second intermediate fingerprint based on a remaining second subset of bits from the first intermediate fingerprint, and otherwise, processing the next shingle. Determining the intermediate fingerprint can include determining a first intermediate fingerprint by calculating a random irreducible polynomial based on the shingle, sampling a first subset of bits from the first intermediate fingerprint, determining whether the sampled first subset of bits from the first intermediate fingerprint matches a bit mask pattern, if the sampled first subset of bits from the first intermediate fingerprint matches the bit mask pattern, determining a second intermediate fingerprint based on a remaining second subset of bits from the first intermediate fingerprint, and using the second intermediate fingerprint as the intermediate fingerprint; and otherwise, processing the next shingle. Calculating the random irreducible polynomial can include performing a table lookup of a pre-computed term of the random irreducible polynomial. The random irreducible polynomial can include (b1*p7+b2*p6+b3*p5+b4*p4+b5*p3+b6*p2+b7*p1+b8) mod M, where bi denotes an i′th byte string of the shingle, where p denotes a prime constant, and M denotes a constant. The comparator configured for comparing the intermediate fingerprint from the fingerprint circuit with the previous fingerprint from the fingerprint buffer to determine whether the intermediate fingerprint is more representative of the contents of the previous block than the previous fingerprint can include the comparator being configured for determining whether the intermediate fingerprint is larger than the previous fingerprint, and where determining whether the intermediate fingerprint is larger than the previous fingerprint determines whether the intermediate fingerprint is more representative of the contents of the received block than the previous fingerprint. The comparator configured for comparing the intermediate fingerprint from the fingerprint circuit with the previous fingerprint from the fingerprint buffer to determine whether the intermediate fingerprint is more representative of the contents of the previous block than the previous fingerprint can include the comparator being configured for determining whether the intermediate fingerprint is smaller than the previous fingerprint, and where determining whether the intermediate fingerprint is smaller than the previous fingerprint determines whether the intermediate fingerprint is more representative of the contents of the received block than the previous fingerprint. The fingerprint circuit can include a first adder, a second adder, a third adder, a fourth adder, and a bit shifter. The first adder, the second adder, and the third adder can be configured for determining a first intermediate fingerprint by performing a modulo operation between a Mersenne prime and the shingle. The modulo operation can be performed by adding, using the first adder, a first subset of high order bits of the shingle to a second subset of high order bits of the shingle; adding, using the second adder, a first subset of low order bits of the shingle to a second subset of low order bits of the shingle; and determining, using the third adder, the first intermediate fingerprint by adding a result of the first adder to a result of the second adder. The bit shifter and the fourth adder can be configured for determining a second intermediate fingerprint by performing a random permutation of the first intermediate fingerprint. Performing the random permutation can include performing, using the bit shifter, a bit shift operation by a random number of bits on the first intermediate fingerprint; performing, using the fourth adder, an addition operation by a random constant on the second intermediate fingerprint, and using the second intermediate fingerprint as the intermediate fingerprint. The fingerprint circuit can include a polynomial subcircuit, a bit shifter, and an adder. The polynomial subcircuit can be configured for determining the first intermediate fingerprint, where the polynomial subcircuit includes a plurality of shift registers and a plurality of logic gates arranged to generate a Rabin fingerprint of the shingle, where the Rabin fingerprint represents a hash value of the contents of the received block. The bit shifter and the adder can be configured for determining a second intermediate fingerprint by performing a random permutation of the first intermediate fingerprint. Performing the random permutation can include performing, using the bit shifter, a bit shift operation by a random number of bits on the first intermediate fingerprint; performing, using the adder, an addition operation by a random constant on the second intermediate fingerprint; and using the second intermediate fingerprint as the intermediate fingerprint. The fingerprint circuit can include a polynomial subcircuit, a first logic gate, and a second logic gate. The polynomial subcircuit can be configured for determining the first intermediate fingerprint, where the polynomial subcircuit includes a plurality of shift registers and a plurality of logic gates arranged to generate a Rabin fingerprint of the shingle, where the Rabin fingerprint represents a hash value of the contents of the received block. The first logic gate can be configured for sampling a first subset of bits from the first intermediate fingerprint by bit masking a subset of high order bits from the first intermediate fingerprint. The second logic gate can be configured for determining the second intermediate fingerprint, upon performing a logical AND operation to determine whether the sampled first subset of bits from the first intermediate fingerprint matches the bit mask pattern; and using the second intermediate fingerprint as the intermediate fingerprint. The fingerprint circuit can include a polynomial subcircuit, a first logic gate, and a second logic gate. The polynomial subcircuit can be configured for determining the first intermediate fingerprint, where the polynomial subcircuit includes a plurality of shift registers and an adder, where the plurality of shift registers and the adder are arranged to calculate a random irreducible polynomial based on the shingle, where the random irreducible polynomial represents a hash value of the contents of the received block. The first logic gate can be configured for sampling a first subset of bits from the first intermediate fingerprint by bit masking a subset of low order bits from the first intermediate fingerprint. The second logic gate is configured for determining the second intermediate fingerprint, upon performing a logical AND operation to determine whether the sampled first subset of bits from the first intermediate fingerprint matches the bit mask pattern; and using the second intermediate fingerprint as the intermediate fingerprint. The polynomial subcircuit can further include a lookup table, where the lookup table includes a pre-computed term of the random irreducible polynomial, and where a term of the random irreducible polynomial is calculated based on looking up a corresponding pre-computed term in the lookup table. The polynomial subcircuit can be configured to store in the shift registers the random irreducible polynomial (b1*p7+b2*p6+b3*p5+b4*p4+b5*p3+b6*p2+b7*p1+b8)mod M, where bi denotes an i′th byte string of the shingle, where p denotes a prime constant, and where M denotes a constant.
Various objects, features, and advantages of the present disclosure can be more fully appreciated with reference to the following detailed description when considered in connection with the following drawings, in which like reference numerals identify like elements. The following drawings are for the purpose of illustration only and are not intended to be limiting of the invention, the scope of which is set forth in the claims that follow.
The present disclosure relates to a content locality based cache design that can be implemented in hardware, firmware, or as a custom application-specific integrated circuit (ASIC). As used herein, content locality refers to systems and methods for caching data blocks according to contents identified to be similar to other cached blocks. For example, some embodiments of the content locality cache can determine data to cache based on recency and frequency of internal contents of data blocks.
Traditional caching has been based on spatial locality, i.e. caching data blocks with similar logical block addresses (LBAs) in memory. Instead, content locality can keep data contents in cache that are popular, and shared by many active data blocks. Popularity and active sharing can represent two indicators that a data block can exhibit content locality. In some embodiments, popularity can be identified by tracking frequency and recency of “content signatures,” also referred to herein as “fingerprints,” which are being accessed by I/O operations. In some embodiments, fingerprint circuits, also referred to herein as signature computation circuits or similarity detection circuits, can identify data blocks that exhibit content locality based on similarity. Accordingly, the signature computation circuits can identify popular content that is cached. Furthermore, the content locality based cache can use delta compression hardware circuits and/or software modules to improve cache usage upon determination or creation of a corresponding associated block.
In some embodiments, the content locality cache can be self-contained and can be offloaded to a host bus adapter (HBA) card or storage controller. Example storage controllers can include an SSD controller, an HDD controller, or a hybrid HDD controller. Example memory used in SSD may include flash memory, phase change memory (PCM), magnetoresistive random access memory (MRAM or MeRAM), or memory resistor (memristor). A high level logic design of the cache can exploit content locality of I/O operations. Advantages of the design can include minimal write operations on SSD, high I/O performance because of effective caching, data reduction as a superset of traditional data deduplication, longer endurance for flash memory and PCM SSD, low overhead in the range of nanoseconds, and scalability and expandability to large server clusters with coherent multiple caches.
To make an SSD an effective cache between system RAM and HDD, systems using content locality caching can reduce write operations in SSD to leverage physical properties of the SSD. The cache can exploit content locality that is independent of and in addition to temporal locality and spatial locality. Temporal locality and spatial locality are principles that have driven traditional cache design. Temporal locality represents the concept that data that has been read or written recently can benefit from caching, under an assumption that the system is likely to access the data again. Spatial locality represents the concept that the system can benefit from caching related data in close-by memory addresses. Experimental results and customer installations have shown advantages of content locality based caching. For example, the content locality cache has been implemented as software working at the level of data blocks as a device driver running in OS kernels. This software implementation has advantages of working with any storage hardware, and being portable to different operating systems to provide performance advantages.
However, the fact that the prototype can be implemented as software running on servers can also have limitations. First, the software implementation can use system resources of the server on which the software runs. Example resources used can include CPU time, system RAM space, and bus bandwidth. In contrast, a hardware-based implementation can offload cache functions to a controller or device level, thereby allowing the host to spend more time and resources working on applications. Second, the overhead of software cache management algorithms can take microseconds of precious I/O processing time. As device technologies advance, access times of PCM, MRAM, and memrister come down to the range of nanoseconds. Therefore, the high speed cache design may benefit from overhead shorter than microsecond-length overhead. Accordingly, the content locality based cache exploits physical properties of SSDs and data content locality of I/O operations to provide performant I/O without using server resources and while providing manageable overhead in the nanosecond range.
In the summary above, the detailed description, the claims below, and in the accompanying drawings, reference is made to particular features (including method steps). It is to be understood that the disclosure this specification includes all possible combinations of such particular features. For example, where a particular feature is disclosed in the context of a particular aspect or embodiment, or a particular claim, that feature can also be used, to the extent possible, in combination with and/or in the context of other particular aspects and embodiments, and embodiments generally.
Where reference is made herein to a method comprising two or more defined steps, the defined steps can be carried out in any order or simultaneously (except where the context would indicate otherwise), and the method can include one or more other steps which are carried out before any of the defined steps, between two of the defined steps, or after all the defined steps (except where the context would indicate otherwise).
A host computer system can refer to any computer system that uses and accesses a data storage system for data read and data write operations. Such host system may run applications such as databases, file systems, web services, and so forth.
SSD can refer to any solid state disks such as NAND gate flash memory, NOR gate flash memory, phase change memory (PCM), memory resistor (memristor) memory, resistive random access memory (ReRAM), magnetoresistive random access memory (MRAM or MeRAM), or any nonvolatile solid state memory having the properties of fast reads, slow writes, and limited life time due to wearing caused by write operations.
Mass storage can include hard disk drives (HDDs) including but not limited to hard disk drives, nonvolatile RAM (NVRAM), MEMS storage, and battery backed DRAM. Although the descriptions in this disclosure include hard disk drives with spinning disks, generally any type of non-volatile storage can be used in place of hard disk drive.
An intelligent processing unit can refer to any computation engine capable of high performance computation and data processing, including but not limited to GPU (graphic processing unit), CPU (central processing unit), embedded processing unit, MCU (micro controller unit), a custom ASIC (application-specific integrated circuit), firmware, or custom hardware. The term intelligent processing unit and GPU/CPU are used interchangeably in the present disclosure.
HBA can refer to any host bus adaptor that connects a storage device to a host through a bus, such as PCI, PCI-Express, PCI-X, InfiniBand, HyperTransport, and the like. Examples of HBAs include SCSI PCI-E card, SATA PCI-E card, iSCSI adaptor card, Fibre Channel PCI-E card, etc.
LBA can refer to a logic block address that represents the logical location of a data block in a storage system. A host computer may use a logical block address to read or write a data block.
Primary storage 308 includes but is not limited to spinning hard disk drives, non-volatile random access memory (NVRAM), battery backed dynamic random access memory (DRAM), MEMS storage, SAN, NAS, virtual storage, and the like. Primary storage 308 may store deltas in delta blocks. A delta represents differences between a data block of an active disk I/O operation and its corresponding reference block. Delta blocks may be data blocks that contain multiple deltas. A delta may be derived dynamically at run time. The delta may represent a difference between a data block of an active primary storage I/O operation and its corresponding reference block that may be stored in SSD 304. Intelligent processing unit 310 may be any type of computing engine such as a GPU, CPU, or MCU capable of doing computations such as similarity detection, delta derivations upon I/O writes, combining deltas with reference blocks upon I/O reads, data compression and decompression, and other necessary functions for interfacing primary storage 308 with host 302. Although
Intelligent processing unit 310 first determines whether requested data block 608 has a corresponding reference block 602 stored in SSD 304. If a corresponding reference block 602 is stored in SSD 304, intelligent processing unit 310 accesses corresponding reference block 602 stored in SSD 304 and reads corresponding delta 608 from either cache or primary storage based on the requested data block metadata that is accessible to intelligent processing unit 310. Intelligent processing unit 310 then combines reference block 602 with delta 604 to obtain the requested contents of data block 608. Intelligent processing unit 310 then returns combined data block 608 to host 302.
Since deltas may generally be small due to data regularity and content locality, some embodiments store deltas in a compact form so that one SSD or HDD operation contains enough deltas to generate tens or even hundreds of I/Os operations. The goal may be to convert the majority of I/O operations from traditional seek-rotation-transfer I/O operations on HDD to I/O operations involving mainly SSD reads and high-speed computations. The former can take tens of milliseconds whereas the latter can take tens of microseconds or even nanoseconds using implementations of the content locality based cache in hardware and/or software. The speedups described herein can represent differences of three to six orders of magnitude in improvements. As a result, the SSD in some embodiments may function as an integral part of a cache memory architecture that takes full advantage of fast SSD read performance while avoiding the drawbacks of SSD erase/write performance. Because of 1) high speed read performance of reference blocks stored in SSDs, 2) potentially large number of small deltas packed in one delta block stored in HDD, and 3) high performance hardware coupling the two, some embodiments greatly improve disk I/O performance.
Examples of basic operations of cache 808 are described below for two types of operations: (1) read I/O and (2) write I/O.
Disk controller 820 can receive a read I/O requesting the contents of a block. For example, disk controller 820 can receive the read I/O from host 802 via host interface 812. The content locality based cache can check to see if the requested block is in cache 808. If there is a cache hit, disk controller 820 can return the requested contents immediately. If the block is an associated block (i.e., if the block is able to be represented by a reference block and a delta block), disk controller 820 can perform decompression to recreate the requested contents. If there is a cache miss, disk controller 820 can initiate a read operation from primary storage to load the requested data from primary storage. In some embodiments, primary storage can be a hard disk drive (HDD) or storage attached network (SAN). When disk controller 820 loads data to the cache and returns the requested content to the host, disk controller 820 can perform fingerprint computation and similarity detection in parallel, to classify the missed block. If the missed block is determined to be similar enough to a reference block, disk controller 820 can perform data compression. Disk controller 820 can write the requested block to cache 808 according to its type: e.g., reference block, associated block (i.e., delta block), or independent block.
Upon a write I/O, disk controller 820 can perform fingerprint computation and similarity detection. If disk controller 820 identifies a reference block based on the fingerprint computation and similarity detection, disk controller 820 can perform data compression. Depending on whether the write request represents a cache hit or miss and where in the cache the requested block hits, disk controller 820 can perform cache operations similar to the read I/O operations described above. If cache 808 operates as a write-through cache, the data block can be directly written to HDD in parallel to all cache operations such as fingerprint computation and similarity detection. If cache 808 operates as a write-back cache, disk controller 820 can write the data block as dirty data in cache 808 only. Disk controller 820 can later write the dirty data to HDD using write algorithms including pre-cleaning, on-demand destaging, or FIFO flushing. If peer to peer caches are implemented for high availability (HA), disk controller 820 can perform data mirroring after compression to selected peer caches. In some embodiments, disk controller 820 can perform data mirroring using a cache coherence protocol including a sliding window of eager execution transactions (SWEET). Further information regarding the SWEET cache coherence protocol may be found in U.S. Pat. No. 8,140,772, entitled “System and method for maintaining redundant storages coherent using sliding windows of eager execution transactions” and filed Oct. 31, 2008, the entire contents of which are incorporated by reference herein.
Software module 1210 runs at the device driver level such as a generic block layer, a filter driver layer, or any layer in the I/O stack. Software module 1210 controls an independent flash memory 1214 and independent HDD 1218 that may be connected to system bus 1220. Software module 1210 interfaces over system bus 1220 with standard off-the-shelf hardware for flash memory 1214 and HDD 1218. System bus 1220 includes but is not limited to protocols such as PCI, PCI-Express, PCI-X, HyperTransport, InfiniBand, SAS, SATA, SCSI, PATA, USB, etc. Software module 1210 runs on host 1202. Software module 1210 operates and communicates directly with flash memory 1214 and HDD 1218. Software module 1210 also controls part of system RAM 1212 as a cache to buffer reference blocks, deltas, and independent blocks for efficient I/O operations. Software module 1210 also interfaces and communicates with upper layer software modules such as OS 1208 and applications 1204 running on host 1202.
In some embodiments, software module 1210 may be implemented without requiring hardware changes, but may use system resources such as CPU, RAM, and system bus. For I/O bound jobs, CPU utilization may be very low and the additional overhead caused by the software expected to be small. This is particularly evident as processing power of CPUs may increase more rapidly than I/O systems. In addition, software implementations may require different designs and implementations for different operating systems.
Receiving a block to cache (step 1368) can include receiving a block from a write operation or a read operation from the host. For example, a write I/O operation can include new contents for storing a new block in the content locality cache or updating an existing block in the cache. The content locality cache can also receive the block as a result of a read I/O operation, for example upon a cache miss. A cache miss can occur when a requested block for reading is not found in the cache. The content locality cache can retrieve the requested block from primary storage, return the requested contents to the host, and cache the retrieved block. Accordingly, upon a subsequent read operation requesting the contents of the same block, the subsequent read operation can result in a cache hit that speeds performance because the content locality cache is able to avoid reading the requested contents from the relatively slower primary storage.
Determining a sub-signature “sketch” of the received block (step 1370) can include determining multiple signatures, sometimes referred to herein as “fingerprints,” of a block. The fingerprints can represent the contents of the received block. The fingerprints can speed detection of content similarity among blocks by providing relatively smaller units that are easier to compare algorithmically because the units are discrete. In some embodiments, the content locality cache can divide the received block into subsets, sometimes referred to herein as “shingles.” A shingle can be a subset of an overall block. For example, the size of the received block can be 4 KB, and the corresponding size of the shingle can be 8 bytes. (Accordingly, for an example block of size 4 KB, there can be 4K-7 shingles corresponding to various subsets of the block.) In some embodiments, fingerprint circuits, also referred to herein as signature computation circuits, can process shingles in parallel to identify multiple representative fingerprints or signatures of a shingle. In some embodiments, the fingerprint circuits can process the shingles using Mersenne primes, Rabin fingerprinting, or random irreducible polynomials (shown in
Method 1366 can include searching a reference data area of the content locality cache to determine content similarity of the received block, based on the sub-signature “sketch” (step 1372). For example, the content similarity can be determined by comparing sketches stored in a tag area of the content locality cache. If the number of matching sub-signatures in the sketch exceeds a threshold (step 1374: Yes), the received block can be determined to have similar content to a reference block already in the content locality cache. Accordingly, the content locality cache can create a “delta” that represents a compressed version of the received block (step 1376). The content locality cache can compare the delta with reference blocks to determine similarity. For example, if the delta is determined to differ by more than a threshold (step 1378: No), then the new data block can be characterized as an independent data block (step 1382). An example threshold can be if the delta is determined to differ by over ½ with the reference block. An independent data block refers to a block that can be cached, but the caching can be determined based on most recent use (e.g., temporal locality) or similarity of memory address (e.g., spatial locality), rather than similarity of content (e.g., content locality). Similarly, if the number of matching sub-signatures in the “sketch” is determined to be less than the threshold (step 1374: No), then the new data block can also be characterized as an independent data block (step 1382).
Updating metadata in the HeatMap (step 1384) can include, for example, updating measures of “popularity” of the received block. The popularity measure can measure an extent to which the contents of the received block are shared by other active data blocks in the content locality cache.
Method 1386 can receive a read I/O operation requesting the contents of a block (step 1388). For example, the host can send the read I/O operation to the content locality cache. Method 1386 can determine whether the requested block has a corresponding reference block (step 1390). For example, the content locality cache can compare metadata associated with the requested block with metadata associated with the cached reference blocks, associated blocks, and independent data blocks to determine whether the requested block has a corresponding reference block.
If the requested block has a corresponding reference block (step 1390: Yes), method 1386 can include decompressing the contents of the requested block based on the corresponding block and a corresponding delta (step 1392). For example, the content locality cache can determine the corresponding delta by retrieving an associated block storing the delta from an associated block area of the content locality cache. In some embodiments, method 1386 can include recreating requested content for the received block by starting from the corresponding associated block and incorporating shingles of the corresponding reference block.
If the requested block does not have a corresponding reference block (step 1390: No), method 1386 can include finding the requested block either as an independent block or in primary storage (step 1394). In some embodiments, finding the requested block can include determining whether there is a cache hit or a cache miss. Upon a cache hit, method 1386 can determine that the requested block has a corresponding independent block because step 1390: No indicates the requested block lacks a reference block. For example, method 1386 can determine that the requested block has a corresponding independent block by comparing metadata of the requested block with metadata of the independent blocks. Upon a cache miss, method 1386 can determine that the requested block can be found in primary storage, because the cache miss can indicate that the requested block may not be found in the content locality cache, either as a reference block, associated block, or independent block.
Method 1386 can proceed to return the contents of the requested block to the host (step 1396) and fulfill the received read I/O operation.
Host 1302 can receive an I/O operation such as a read or write I/O. The received I/O can include a memory address such as a logical block address (LBA) 1360 for storage into cache storage 1322. Cache storage 1322 can include data array 1338 and tag array 1336 associated with the cache. Signature computation circuit 1324 can perform fingerprint computations and comparisons for use in reference block identification and delta block compression. Compression circuit 1326 can perform delta compression for write I/Os and cache misses. Decompression circuit 1328 can perform data decompression for read I/Os that result in cache hits in associated blocks (whereby structure 1320 combines reference blocks with delta blocks to recreate requested data block contents). Cache management circuit 1330 can perform background flushing, replacement algorithms, and periodic scanning for classification of blocks.
Signature computation circuit 1324 can include fingerprint circuits 1340a, 1340d, comparators 1340b, 1340e, and fingerprint buffers 1340c, 1340f to store resulting fingerprints. Signature computation circuit 1324 can compute a fingerprint for each shingle of a predefined size on a data block. A shingle can represent a window, or subset, of a data block for content analysis to determine content similarity. A fingerprint can represent a content signature of a data block or a content signature of a subset of a data block. For example, a shingle can represent a window, or subset, of a data block, where the window is shifted one byte at a time to determine a relevant subset of a data block for analysis. If an example shingle size is 8 bytes and block size is 4 KB, then signature computation circuit 1324 can compute 4K-7 fingerprints using various iterations. Among the computed fingerprints, structure 1320 can select a certain number of fingerprints to represent a “sketch” of a data block. For example, signature computation circuit 1324 can store about six to eight selected fingerprints in fingerprint buffers 1340c, 1340f, or any other number, for representing an overview of the content of a data block. Signature computation circuit 1324 can compute intermediate fingerprints in the process of selecting the overall sketch of the data block.
Fingerprint circuits 1340a, 1340d can perform the intermediate computations to determine the intermediate fingerprints. In some embodiments, fingerprint circuits 1340a, 1340d can use Mersenne primes, Rabin fingerprinting, random irreducible polynomials, or other processes that can provide an overview of content of a shingle of a data block, or of a data block generally. In some embodiments, comparators 1340b, 1340e can store intermediate fingerprints for comparing against a current maximum or minimum fingerprints stored in fingerprint buffers 1340c, 1340f. If an intermediate fingerprint computed by fingerprint circuits 1340a, 1340d is determined to be greater or lower than a current maximum or current minimum fingerprint stored in fingerprint buffers 1340c, 1340f, then comparators 1340b, 1340e can replace the contents of fingerprint buffers 1340c, 1340f with the new maximum or minimum fingerprint. Structure 1320 can use the fingerprints and sketch to perform similarity detection among data blocks, by comparing respective sketches or groups of fingerprints.
Signature computation circuit 1324 can include several different processes implemented in hardware, software, or a combination for fingerprint calculation and sampling (shown in
Cache storage 1322 can include the actual memory cells used to store cached data associated with requested blocks in the SSD cache. In some embodiments, the memory cells can include flash memory cells, PCM memory cells, or MRAM cells. Cache storage 1322 can be divided into two parts: tag array 1336 and data array 1338. Tag array 1336 can store logical block addresses (LBAs), fingerprints, and status information corresponding to each cached data block. In some embodiments, the LBA and fingerprint portions of tag array 1336 can be implemented using content addressable memory (CAM), so that structure 1320 can perform associative search upon each access. For example, upon an I/O operation, structure 1320 can search the cache associatively in tag array 1336 to find a match based on the LBA of the I/O request. If structure 1320 finds a match, a cache hit occurs. Otherwise, the I/O operation results in a cache miss. In some embodiments, structure 1320 can be based on a fully associative cache design.
In some embodiments, if the cache size is large, a set associative mapping can be implemented. In a set associative mapping, part of an LBA of interest can go through a decoder to index one of N sets. Within the indexed set, structure 1320 can perform associative search to find a matching LBA for a cache hit. In some embodiments, the fingerprint portion of tag array 1336 can also be implemented using CAM cells, so that structure 1320 can use associative search to find partially matching signatures for similar blocks. In further embodiments, the amount of partial match is a system design parameter that can be tuned to improve performance. For example, structure 1320 can use a threshold of six of eight fingerprints 1340c, 1340f in a partial match for similarity determination. A reference pointer field can store a location of a reference block associated with a data block of interest. Status bits can contain a cache status of a block of interest. Example values for the status bits may include clean, dirty, least recently used (LRU) counter value, etc., as used by cache management circuit 1330.
In some embodiments, data array 1336 can be partitioned into three parts: (1) reference data area 1342a, (2) associated block area 1342b, and (3) independent data area 1342c. In further embodiments, the size of reference data area 1342a can be selected to be small while the size of associated block area 1342b can be selected to be large. Structure 1320 can compress data using compression circuit 1326 in associated block area 1342b against reference blocks in reference data area 1342a. Independent data area 1342c contains independent blocks. Independent blocks refer to blocks that do not show content locality, but may be cached for other reasons, e.g. based on temporal locality or spatial locality. In some embodiments, the illustrated border lines that separate reference data area 1342a, associated data area 1342b, and independent data area 1342c can change dynamically. For example, the size of a respective area may change depending on I/O workload and data access locality of running applications.
Cache management circuit 1330 can include HeatMap 1358, timer 1334, and counter 1356. Cache management circuit 1330 can perform background flushing, replacement, and periodic scanning for classification of blocks. HeatMap 1358 can store fingerprints corresponding to encoded reference blocks to form a table or a directory. HeatMap 1358 can be indexed according to shingles. As described earlier, a shingle can represent a sliding window, or subset of bits, of contents of data blocks. For example, HeatMap 1358 can be indexed using hash functions based on determining Mersenne primes of each shingle. When a shingle of an incoming data block matches a shingle indexed in the directory, an associated block can store a “delta” corresponding to the incoming data block. The delta can include (1) an offset of the shingle in the reference block and (2) a matched length. Cache management circuit 1330 can also manage status bits. Status bits can contain a cache status of a block of interest. Example values for the status bits may include clean, dirty, least recently used (LRU) counter value, etc. Cache management circuit 1330 can use the status bits to perform background processes such as background flushing, replacement, and periodic scanning for classification of blocks. For example, timer 1334 can be used as an idle detector to determine when to perform the background processes. When performing the background processes, counter 1356 can use eviction logic to determine when a data block should be evicted after a certain threshold has been reached. Counter 1356 can also periodically scan data area 1338 to identify cached blocks that are candidates for reclassification. For example, counter 1356 can scan for (1) reference blocks that should be reclassified into independent blocks and/or deltas for associated blocks, (2) associated blocks that should be reclassified into reference blocks and/or independent blocks, or (3) independent blocks that should be reclassified into reference blocks and/or deltas for associated blocks.
Compression circuit 1326 can include buffer 1344, delta compression module 1346, threshold comparator 1348, and logic gates 1350, 1352. Compression circuit 1326 can perform delta compression once a new data block is determined to be sufficiently similar to a reference block in reference data area 1342a. For example, buffer 1344 can store a received data block from a write I/O from host 1302. Delta compression circuit can compare the contents of buffer 1344 with reference blocks in reference data area 1342a to determine similarity. In some embodiments, threshold comparator 1348 can determine similarity. For example, if the content of new data block is determined to differ by over ½ with the reference blocks in reference data area 1342a, then the new data block can be characterized as an independent data block. The new data block can pass to logic gate 1350, for example, a logical AND gate, for storing the new data block into independent data area 1342c. If threshold comparator 1348 determines the new data block to be sufficiently similar (e.g., with threshold less than ½), delta compression module 1346 can compress the contents of buffer 1344, for example using delta compression. Upon compression, logic gate 1352, for example, a logical AND gate, can store the newly compressed delta into an associated block in associated data area 1342b. Compression circuit 1326 can also be used during periodic scanning and block classification, to compress associated blocks against reference blocks. Structure 1320 may further store the delta together with its corresponding LBA, fingerprint, reference pointer, and cache status bits in corresponding tag area 1336.
If threshold comparator 1348 finds the delta after compression to be large, compression module 1326 can perform false positive similarity detection. An example of a large delta may be one half of the original size, if the threshold between a large delta and small delta is set to ½. For deltas that turn out to be large, compression module 1326 can store the received data block as an independent block in independent data area 1342c. For large deltas, the similarity detection and compression processes performed for the received data block may have been wasted because the processes may not result in a corresponding reference block or delta block for reference data area 1342a and/or associated data area 1342b. Accordingly, compression module 1326 can lower the number of such false detections by tuning relevant parameters. Examples of parameters for tuning may include shingle size, fingerprint size, number of fingerprint matches, sampling size, compression threshold, etc. Furthermore, structure 1320 can perform similarity detection and compression in parallel with normal I/O operations and therefore avoid adversely slowing front end I/O performance.
In some embodiments compression module 1326 can use high speed compression hardware. Examples of high-speed compression hardware include hardware for performing parallel and pipelined encoding using reference blocks to form a table or a directory stored in cache management circuit 1330, sometimes referred to herein as HeatMap 1358. HeatMap 1358 can be indexed according to shingles. A shingle represents a sliding window, or subset of bits, of contents of data blocks. For example, HeatMap 1358 can be indexed using hash functions based on determining Mersenne primes of each shingle. When a shingle of an incoming data block matches a shingle indexed in the directory, an associated block can store a “delta” corresponding to the incoming data block. The delta can include (1) an offset of the shingle in the reference block and (2) a matched length. Parallel and pipelined implementations of compression circuit 1326 can thereby achieve performance of tens or even hundreds of gigabytes per second.
Decompression circuit 1328 can include decompression module 1360, multiplexer 1362, and logic gate 1364. Decompression circuit 1328 can perform delta decompression for received read I/Os, upon a cache hit. For example, a cache hit can happen in associated block area 1338. Upon a cache hit, decompression module 1360 can reassemble a resulting data block to provide to host 1302. For example, decompression module 1360 can reassemble the resulting data block by identifying a reference block from reference data area 1342a and an associated block from associated data area 1342b. For example, decompression module 1360 can recreate requested content starting from an associated block and incorporating shingles of reference blocks. Multiplexer 1362 can also select among recreated content from decompression 1360, and an independent block stored in independent data area 1342c. Upon a cache hit, logic gate 1364, for example a logical AND gate, can provide the requested block to host 1302.
Decompression module 1360 can extract a corresponding delta from the associated block and combine the delta with the reference block. For example, decompression module 1360 can identify shingles of reference blocks by following pointers to the shingles pointed to by offsets stored in the delta-encoded associated block. Since decompression circuit 1328 can affect performance of read I/O operations, decompression circuit 1328 can be designed to be relatively fast in hardware. According to related software-based implementations, decompression can perform much faster than compression. In some embodiments decompression module 1328 can use high speed decompression hardware. As described earlier, implementations of compression circuit 1326 can achieve performance of tens or even hundreds of gigabytes per second. Decompression circuit 1328 can perform even faster. In some embodiments, decompression circuit 1328 can recreate, or reform, requested contents using associated blocks and reference blocks.
I/O scheduling for embodiments described herein may be quite different from scheduling for traditional disk storage. For example, the traditional elevator scheduling algorithm for hard drives (HDD) aims to combine disparate disk I/Os in an order that minimizes seek distances on the HDD. In contrast, content locality based caching facilitates changing I/O access scheduling to emphasize combining I/Os that may be similar to a reference block or may be represented by deltas that are contained in one delta block stored in the primary storage subsystem or a dedicated SSD storage module. To do this scheduling, an efficient metadata structure may relate LBAs of read I/Os to deltas stored in a delta block, and relate LBAs of write I/Os to reference blocks stored in SSD.
To serve I/O requests from the host, some embodiments use a sliding window mechanism similar to the mechanism used in transport control protocol/Internet protocol (TCP/IP) windowing. For example, write I/O requests inside a window may be candidates for delta compression with respect to reference blocks and may be packed into one delta block. Read I/O requests inside the window may be examined to determine all those that were packed in one delta block. The window slides forward as I/O requests are being served. Besides determining the best window size while considering both reliability and performance, some embodiments may be able to pack and unpack a batch of I/Os from the host so that a single HDD I/O operation generates many deltas.
Some embodiments may identifying a reference block in SSD for each I/O operation. For a write I/O, the corresponding reference block, if present, needs to be identified for delta compression. If the write I/O is a new write with no prior reference block, a new corresponding reference block may be identified that has the most similarity to the data block of the write I/O. For a read I/O, as soon as the delta corresponding to the read I/O is loaded, its corresponding reference block may be found to decompress to the original data block.
Quickly identifying reference blocks may be highly beneficial to overall I/O performance. To identify reference blocks quickly, reference blocks may be classified into categories: (1) reference blocks with LBAs identical to delta blocks, (2) data blocks resulting from virtual machine creation, and (3) newly generated data blocks with LBAs that are unassociated with the reference blocks stored in SSD.
The first category includes reference blocks that have exactly the same LBAs that deltas have. For example, these reference blocks may be data blocks originally stored in the SSD, but changes occur on these blocks during online operations such as database transactions or file changes. These changes may be stored as a packed block of deltas to minimize random writes to SSD. Because of content locality, the deltas may be expected to be small. Identifying this type of block may be based on metadata mapping of deltas to reference blocks.
The second category contains data blocks generated as results of virtual machine creations. For example, these data blocks may include copies of guest operating systems (OS), guest application software, and user data that may be largely duplicates with very small differences. Virtual machine cloning enables fast deployment of hundreds of virtual machines in a short time. Different virtual machines access their own virtual disk using virtual disk addresses while the host operating system manages the physical disk using physical disk address. For example, two virtual machines send two read requests to virtual disk addresses V1_LBAO and V2_LBAO, respectively. These two read requests may be interpreted by underlying virtual machine monitor to physical disk addresses LBAx and LBAy, respectively, which may be considered as two independent requests by a traditional storage cache. Embodiments relate and associate these virtual and physical disk addresses by retrieving virtual machine related information from each I/O request. Requests with the same virtual address may be considered to have high possibility to be similar and may be combined based on similarity. In the current example, block V1_LBAO (LBAx) is set as the reference block so content locality based caching may derive and keep the difference between V2_LBAO (LBAy) and VI_LBAO (LBAx) as a delta.
The third category consists of data blocks that may be newly generated with LBAs that are not associated with any of the reference blocks stored in SSD. For example, these data blocks may be created by file changes, file size increases, file creations, new tables, and so forth. While these new blocks may contain substantial redundant information compared to some reference blocks stored in the cache, quickly finding the corresponding reference blocks that have most similarity may allow helpful use of the delta-compression and other techniques described herein. In some embodiments, to support fast similarity detection, a similarity detection algorithm is described herein based on wavelet transforms using an intelligent processing unit, custom ASIC, firmware, hardware, or software modules. Traditionally, hashing has been used to identify identical blocks. In contrast, some embodiments may detect similarity between two data blocks by determining subsignatures that represent a combination of several hash values of subblocks. The similarity detection algorithm may further exploit modern CPU architectures.
The similarity of two blocks may be determined by the number of subsignatures that the two blocks share. A sufficient number of shared subsignatures may indicate that the two blocks are similar in content (e.g. they share many same subsignatures). However, such content similarity can be either an in-position match or an out-of-position match. In an out-of-position match, a position change is caused by content shifting (e.g., inserting a word at the beginning of a block shifts all remaining bytes down by the word). To handle both in-position matches and out-of-position matches efficiently, embodiments use a combination of regular hash computations and wavelet transformation. Hash values for every three consecutive bytes of a block may be computed to produce a one byte signature. A Haar wavelet transform may be also computed. The most frequently occurring subsignatures may be selected along with a number of coefficients of the wavelet transform for signature matching. For example, six of the most frequently occurring subsignatures and three of three wavelet transform coefficients may be selected. That is, nine signature matching elements representing a block may be compared: six sub-signatures and three coefficients of the wavelet transform. Hash values may be computed with more or fewer than three consecutive bytes. Similarly, more or fewer than six frequent sub-signatures may be selected. Likewise, more or fewer than three Haar wavelet coefficients may be selected.
The three coefficients of the wavelet transform may include one total average, and positions of two largest amplitudes. The total average coefficient value may be used to pick the best reference if multiple matches are found for the other eight signatures.
Consider an example of a 4 KB block. Embodiments first calculate the hash values of all sets of three consecutive bytes to obtain 4K−2 sub-signatures. Among these sub-signatures, the six most frequent sub-signatures may be selected together with the three coefficients of the wavelet transform to carry out the similarity detection. If the number of matches of two blocks exceeds seven, they may be considered to be similar. Based on experimental observations, this position-aware sub-signature matching mechanism can recognize not only shifting of content but also shuffling of contents.
In some embodiments, subsignatures of a data block may also be determined using sliding tokens. An example size of the token ranges from three bytes to hundreds of bytes. The token slides one byte a time from the beginning to the end of the block. Hash values of each sliding token are computed using Rabin fingerprinting, Mersenne prime modulus, random irreducible polynomials, etc. Sampling or sorting techniques may be used to select a few subsignatures of each block for similarity detection and reference selection processing.
For periodic similarity detection, the period length and set of blocks to be examined may be configured based on performance requirements and the sizes of available RAM, SSD and primary storage if available. For periodic similarity detection, after selection of a set of cached blocks (step 1802) to examine for similarity detections, popularity of each block may be computed (step 1804). Each block may then be evaluated to determine its popularity. If the popularity of a block exceeds a predefined and configurable threshold value (step 1808: Yes), the data block may be designated as a reference block (step 1810) to be stored in RAM or SSD. If the intelligent processing unit determines that the similarity value of the two blocks is less than the threshold value (step 1808: No), the process continues with other data blocks (step 1812). Designated reference blocks may be stored in the cache, and metadata about the block may be updated to allow association of remaining similar blocks for delta-compression. Finally, after comparing all data blocks in the set, the HeatMap is cleared (step 1818) to begin a new phase of sub-signature generation and block popularity accounting. The HeatMap refers to a two dimensional array of subsignature related data used for similarity detection based on stored subsignatures.
In some embodiments, the data compression (step 1910) includes delta compression techniques. The delta compression techniques may perform delta compression of the newly loaded block to determine the degree of similarity between the newly loaded block and the identified reference block (step 1910). The degree of similarity is tested by comparing the size of the delta generated through delta-compression against a maximum difference threshold (step 1914). If the delta-compression results in a delta that is at least a small as a delta size threshold (step 1914: Yes), the newly loaded block can be represented by a combination of the delta and a reference block. The intelligent processing unit therefore stores the derived delta is stored in the cache system memory and updates cache management meta-data (step 1918).
If the delta-compression derived difference is larger than the delta size threshold (step 1914: No), then the block may be sufficiently different to warrant being maintained as an independent block (step 1912). In some embodiments, the newly loaded block may be stored as an independent block (i.e., a block that is not represented by a combination of deltas with respect to a reference block), and cache meta-data is updated (step 1912).
Embodiments may attempt to store reference blocks in SSD that do not change frequently and that share similarities with many other data blocks. Guidelines for determining what data to store in SSD and how often to update SSD may be established. Such guidelines may tradeoff size, cost, available SSD memory, application factors, processor speed(s), and the like. An initial design guideline may allow storing as base data (e.g., in SSD or RAM) the entire software stack including OS and application software, as well as all active user data. This may be feasible with today's large-volume and less expensive NAND flash memories coupled with the fact that only a small percentage of file system data are typically accessed over a week. Data blocks of the software stack and base data may be reference blocks in SSD. Run time changes to these reference blocks may be stored in compressed form in delta blocks in HDD. These changes include changes on file data, database tables, software changes, virtual machine images, and the like. Such changes may be incremental so they can be very effectively compacted in delta blocks. As changes keep occurring, incremental drift may get larger and larger. To maintain efficiency, data stored in the SSD may be updated to avoid large incremental drift. Each update may result in changes in SSD and HDD as well as associated metadata.
The next design decision may be block size of reference blocks and delta blocks. For example, larger reference blocks may reduce meta-data overhead and may allow more deltas to be covered by one reference block. However, if reference block size is too large, the large size places a burden on the intelligent processing unit for computation and caching. Similarly, large delta blocks allow more deltas to be packed in, and potentially high I/O efficiency because one disk operation generates more I/Os (note that each delta in a packed delta block represents one I/O block). On the other hand, it may be a challenge whether I/Os generated by the host can take full advantage of this large amount of deltas in one delta block.
Another trade-off may be whether to allow deltas packed in one delta block to refer to a single reference block or multiple reference blocks in SSD. Using one reference block to match all deltas in one delta block allows compression/decompression of all deltas in the delta block to be done with one SSD read. On the other hand, it may be preferable that deltas compacted in one delta block belong to I/O blocks that may be accessed by the host in a short time frame (i.e., temporal locality) so that one HDD operation can satisfy more I/Os that may be in one batch. These I/O blocks in the batch may not necessarily be similar to exactly one reference block for compression purposes. As a result, multiple SSD reads may be necessary to decompress different deltas stored in one delta block. Furthermore, random read speed of SSD is so fast that it may be affordable to carry out reference block reads in this manner.
Some embodiments may include a DRAM buffer that temporarily stores I/O data blocks including reference blocks and delta blocks that may be accessed by host I/O requests. This DRAM may buffer the following types of data blocks: (1) compressed deltas, (2) data blocks for read I/Os after decompression, (3) reference blocks from SSD, and (4) data blocks of write I/Os. Management of the DRAM buffer may involve several interesting trade-offs. The first interesting tradeoff may be whether compressed deltas are cached for memory efficiency, or whether decompressed data blocks are cached to facilitate high performance read I/Os. If compressed deltas are cached, the DRAM can store a large number of deltas corresponding to many I/O blocks. However, upon each read I/O, on-the-fly computation may be necessary to decompress the delta to its original block. If decompressed data blocks are cached, these blocks may be readily available to read I/Os but the number of blocks that can be cached is smaller than caching deltas.
The second interesting tradeoff may be the space allocation of the DRAM buffer to the four types of blocks. Caching large number of reference blocks can speed up the process of identifying a reference block, deriving deltas upon write I/Os, and decompressing a delta to its original data block. However, read speed of reference blocks in SSD may already be very high and hence the benefit of caching such reference blocks may be limited. Caching a large number of data blocks for write I/Os, on the other hand, helps with packing more deltas in one delta block but raise reliability issues. Static allocation of cache space to different types of data blocks may be simple but may not be able to achieve optimal cache utilization. Dynamic allocation, on the other hand, may utilize the cache more effectively but incurs more overhead.
The third interesting tradeoff may be fast write of deltas to SSD/primary storage versus delayed writes for packing large number of deltas in one delta block. For reliability purposes, it may be preferable to perform a write to SSD/primary storage as soon as possible whereas for performance purposes it may be preferable to pack as many deltas in one block as possible before executing an SSD/primary storage write operation.
The computation time of Rabin fingerprint hash values is measured for large data blocks on intelligent processing units such as multi-core GPU/CPUs. A Rabin fingerprint is helpful in identifying reference blocks in SSD. The times it takes to compute hash values of a data block with size of 4 KB to 32 KB may be in the range of a few to tens of microseconds. In some embodiments, three of the most time-consuming processing parts have been implemented on the intelligent processing unit.
The first part implemented on the intelligent processing unit is signature generation for data blocks. In some embodiments, signature generation includes hashing calculations, sub-signature sampling, the Haar wavelet transform, and final selection of representative sub-signatures. As described previously, groups of consecutive bytes may be hashed to derive a distribution of sub-signatures. This operation can be done in parallel by calculating all hash values at the same time using multi threads. Sampling and selection may be done using random sample, sorting based on histogram, or min wise independent selection.
The second part implemented on the intelligent processing unit is periodic Kmean computations to identify similarities among unrelated data blocks. Such similarity detection can be simplified as a problem of finding k centers in a set of points. The remaining points may be partitioned into k clusters so that a total within a cluster sum of squares (TWCSS) is minimized according to known TWCSS calculation algorithms. Multiple threads may be able to calculate the TWCSS for all possible partitioning solutions at the same time. The results may be synchronized at the end of the execution, and the resulting clustering identifies similarities among unrelated data blocks. In an experimental prototype implementation, Kmean computation was invoked periodically to identify reference blocks to be stored in the cache.
The third part implemented on the intelligent processing unit is delta compression and decompression. In some embodiments a ZDelta compression algorithm or LZO compression algorithm may be used. However, optimization of the delta codec is within the scope of content locality based caching and may benefit from fine tuning
In order to see whether embodiments may be practically feasible and provide anticipated performance benefits, an experimental proof-of-concept prototype was developed using an open source kernel virtual machine (KVM). The prototype represents a partial realization, using a software module, of content locality based caching. The system is referred to as I-CASH (I-CASH is a short name Intelligently Coupled Array of SSD and HDD).
The functions that the prototype has implemented include identifying reference blocks in a virtual machine environment and using Kmean similarity detections periodically, deriving deltas using ZDelta algorithm for write I/Os, serving read I/Os by combining deltas with reference blocks, and managing interactions between SSD and HDD. The current prototype carries out computations using the host CPU and uses a part of system RAM as the DRAM buffer of the I-CASH. A GPU was not used for computation tasks in the prototype. It is believed that the performance evaluation using this preliminary prototype thereby presents a conservative result.
In order to capture both block level I/O request information and virtual machine related information, the prototype module may be implemented in the virtual machine monitor. The I/O function of the KVM depends on QEMU that is able to emulate many virtual devices including virtual disk drive. The QEMU driver in a guest virtual machine captures disk I/O requests and passes them to the KVM kernel module. The KVM kernel module then forwards the requests to QEMU application and returns the results to the virtual machine after the requests are complete. The I/O requests captured by the QEMU driver are block-level requests of the guest virtual machine. Each of these requests contains the virtual disk address and data length. The corresponding virtual machine information may be maintained in the QEMU application part. The embodiment of the prototype may be implemented at the QEMU application level and may therefore be able to catch not only the virtual disk address and the length of an I/O request but also the information of which virtual machine generates this request. The most significant byte of the 64-bit virtual disk address may be used as the identifier of the virtual machine so that the requests from different virtual machines can be managed in one queue. If two virtual machines are built based on the same OS and application, two I/O requests may be candidates for similarity detection if the lower 56 bits of their addresses are identical.
The prototype software module maintains a queue of disk blocks that can be one of three types: reference blocks, delta blocks, and independent blocks. It dynamically manages these three types of data blocks stored in the SSD and HDD. When a block is selected as a reference, its data may be stored in the SSD and later changes to this block may be redirected to the delta storage consisting of the DRAM buffer and the HDD. In the current implementation, the DRAM is part of the system RAM with size being 32 MB. An independent block has no reference and contains data that can be stored either in the SSD or in the delta storage. To make an embodiment work more effectively, a threshold may be chosen for delta blocks such that delta derivation is not performed if the delta size exceeds the threshold value and hence the data is stored as independent block. The threshold length of delta determines the number of similar blocks that can be detected during similarity detection phase. Increasing the threshold may increase the number of detected similar blocks but may also result in large deltas limiting the number of deltas that can be compacted in a delta block. Based on experimental observations, 768 bytes are used as the threshold for the delta length in the prototype.
Similarity detection to identify reference blocks is done in two separate cases in the prototype implementation. The first case is when a block is first loaded into an embodiment's queue and the embodiment searches for the same virtual address among the existing blocks in the queue. The second case is periodical scanning after every 20,000 I/Os. At each scanning phase, the embodiment first builds a similarity matrix to describe the similarities between block pairs. The similarity matrix is processed by the Kmean algorithm to find a set of minimal deltas that are less than the threshold. One block of each such pair is selected as a reference block. The association between newly found reference blocks and their respective delta blocks is reorganized at the end of each scanning phase.
A prototype may be installed on a KVM of a Linux operating system running on a PC server that is a Dell PowerEdge T410 with 1.8 GHz Xeon CPU, 2 GB RAM, and 160 GB SATA drive. This PC server acted as the primary server. An SSD drive (OCZ Z-Drive p84 PCI-Express 250 GB) was installed on the primary server. Another PC server, the secondary server, was a Dell Precision 690 with 1.6 GHz Xeon CPU, 2 GB RAM and 400 G Seagate SATA drive. The secondary server was used as the workload generator for some of the benchmarks. The two servers were interconnected using a gigabit Ethernet switch. The operating system on both the primary server and the secondary server was Ubuntu 8.10. Multiple virtual machines using the same OS were built to execute a variety of benchmarks.
For performance comparison purposes, a baseline system was also installed on the primary PC server. One difference between the base line system and a system implementing a content locality cache is the way the SSD and HDD are managed. In the baseline system, the SSD is used as an LRU disk cache on top of the HDD. In present prototype, on the other hand, the SSD stores reference data blocks and HDD stores deltas as described previously.
Appropriate workloads may be important for performance evaluations. It should be noted that evaluating the performance of embodiments is unique in the sense that I/O address traces are not sufficient because deltas are content-dependent. That is, the workload should have data contents in addition to address traces. Because of this uniqueness, none of the available I/O traces is applicable to the performance evaluations. Therefore, seven standard I/O benchmarks that are available to the research community have been collected as shown in Table 1.
The first benchmark, RUBiS, is a prototype that simulates an e-commerce server performing auction operations such as selling, browsing, and bidding similar to eBay. To run this benchmark, each virtual machine on the server has installed Apache, Mysql, PHP, and RUBiS client. The database is initialized using the sample database provided by RUBiS. Five virtual machines are generated to run RUBiS using the default settings of 240 clients and 15 minutes running time.
TPC-C is a benchmark modeling operations of real-time transactions. It simulates the execution of a set of distributed and on-line transactions (OLTP) on a number of warehouses. These transactions perform the basic database operations such as inserts, deletes, updates and so on. Five virtual machines are created to run TPCC-UVA implementation on the Postgres database with 2 warehouses, 10 clients, and 60 minutes running time.
In addition to RUBiS and TPC-C, five data intensive SPEC benchmarks developed by the Standard Performance Evaluation Corporation (SPEC) have also been set up. SPECMail measures the ability of a system to act as an enterprise mail server using the Internet standard protocols SMTP and IMAP4. It uses folders and message MIME structures that include both traditional office documents and a variety of rich media contents for multiple users. Postfix was installed as the SMTP service, Dovecot as the IMAP service, and SPECmail2009 on 5 virtual machines. SPECmail2009 is configured to use 20 clients and 15 minutes running time. SPECweb2009 provides the capability of measuring both SSL and non-SSL request/response performance of a web server. Three different workloads are designed to better characterize the breadth of web server workload. The SPECwebBank is developed based on the real data collected from online banking web servers. In an experiment, one workload generator emulates the arrivals and activities of 20 clients to each virtual web server under test. Each virtual server is installed with Apache and PHP support. The secondary PC server works as a backend application and database server to communicate with each virtual server on the primary PC server. The SPECwebEcommerce simulates a web server that sells computer systems allowing end users to search, browse, customize, and purchase computer products. The SPECwebSupport simulates the workload of a vendor's support web site. Users are able to search for products, browse available products, filter a list of available downloads based upon certain criteria, and download files. Twenty clients are set up to test each virtual server for both SPECwebEcommerce and SPECwebSuppor for 15 minutes. The last SPEC benchmark, SPECsfs, is used to evaluate the performance of an NFS or CIFS file server. Typical file server workloads such as LOOKUP, READ, WRITE, CREATE, and REMOVE are simulated. The benchmark results summarize the server's capability in terms of the number of operations that can be processed per second and the I/O response time. Five virtual machines are setup and each virtual NFS server exports a directory to 10 clients to be tested for 10 minutes.
Using the preliminary prototype and the experimental settings, a set of experiments have been carried out running the benchmarks to measure the I/O performance of embodiments as compared to a baseline system. The first experiment is to evaluate speedups of embodiments compared to the baseline system. For this purpose, all the benchmarks were executed on both embodiments and on the baseline system.
While I/O performance generally increases with the increase of SSD cache size for the baseline system, the performance change of the tested embodiment depends on many other factors in addition to SSD size. For example, even though there is a large SSD to hold more reference blocks, the actual performance of the tested embodiment may fluctuate slightly depending on whether or not the system is able to derive a large amount of small deltas to pair with those reference blocks in the SSD, which is largely workload dependent. Nevertheless, the tested embodiment performs constantly better than the baseline system with performance improvement ranging from 50% to a factor of 4 as shown in
The speedups shown in
To isolate the effect of computation times, the total number of HDD operations of the tested embodiment and that of the baseline system were measured. The I/O reductions of the tested embodiment were then calculated as compared to the baseline by dividing the number of HDD operations of the baseline system by the number of HDD operations of the tested embodiment.
From
Because of time constraint, benchmark running time was limited in the experiments. It might have been that the repetitive access pattern may show after a sufficiently long running time since such behavior is observed in real world I/O traces such as SPC-1.
The data storage architecture has been presented exploiting the two emerging semiconductor technologies, flash memory SSD and multi-core GPU/CPU. In some embodiments, the intelligent processing unit may include one or more custom ASICs, firmware, other custom hardware, or custom software modules such as device drivers. The disk I/O architecture may include intelligently coupling an array of SSDs and HDDs such that read I/Os are done mostly in SSD and write I/Os to SSD are minimized and done in batches by packing deltas derived with respect to the reference blocks.
By making use of the computing performance of modern GPUs/CPUs and exploiting regularity and content locality of I/O data blocks, some embodiments replace mechanical operations in HDDs with high speed computations. A preliminary prototype realizing partial functionality of the methods and systems described herein has been built on Linux OS to provide a proof-of-concept. Performance evaluation experiments using standard I/O intensive benchmarks have shown great performance potential with up to 4 times performance improvement over systems that use SSD as a storage cache. It is expected that embodiments may dramatically improve data storage performance with fine-tuned implementations and greatly prolong the life time of SSDs that are otherwise wearing quickly with random write operations.
Furthermore, the content locality cache may exploit ever increasing content locality found in a variety of primary storage systems to minimize disk I/O operations that are still a significant bottleneck in computer system performance. A new cache replacement algorithm called Least Popularly Used (LPU) may dynamically identify the reference blocks that may not only have the most access frequency and recency but also may contain information that may be shared or resembled by other blocks being accessed. The LPU algorithms may also leverage methods and systems of caching reference blocks and small deltas to effectively service most disk I/O operations by combining a reference block with a corresponding delta inside the cache as opposed to going to the slow primary storage (e.g. a hard disk). The cache replacement algorithm (LPU) may also be based on a statistical analysis of frequency spectrum of both I/O addresses (e.g. LBAs) and I/O content. Applying a LPU algorithm may also increase a hit ratio of CPU-direct buffer caches greatly for a given cache size through application of content locality considerations in the buffer cache management algorithm. Therefore, embodiments of an LPU algorithm may significantly improve diverse primary storage architectures (RAID, SAN, virtualized storage, and the like) by combining LPU techniques with the various RAM/SSD/HHD cache embodiments described herein. In addition, applying aspects of LPU algorithms to buffer cache management may significantly improve hit ratios without changing or expanding buffer cache memory or hardware.
In order to allow any of the caches described herein and elsewhere to take advantage of data access frequency, recency, and information content characteristics, the systems and methods may determine and track both access behavior and content signatures of data blocks being cached. For example, each cache block may be divided into S logical sub-blocks. A sub-signature may be calculated for each of the S sub-blocks. A two dimensional array of sub-signature related data, sometimes referred to herein as a HeatMap, may be maintained in embodiments of an LPU algorithm. The HeatMap may enable determining popularity of the cached data based on aspects of locality (e.g. content locality, temporal locality, spatial locality, and the like).
An alternate embodiment of HeatMap 1358 may be organized as a two dimensional array that has columns that correspond to the number of possible signature values and rows that correspond to a number of times that each possible signature value has been accessed during a predetermined period of time.
To illustrate how HeatMap 1358 can be organized and maintained as I/O requests are issued, consider an example where each cache block is divided into two sub-blocks and each sub-signature has only four possible values, i.e. Vs=4. The HeatMap of this example is shown in Table 2 below for a sequence of I/O requests accessing data blocks at addresses LBA1, LBA2, LBA3, and LBA4, respectively. In this example, all of the possible contents of sub-blocks are depicted as A, B, C, and D and the corresponding signature for each sub-block is a, b, c, and d respectively. A two dimensional embodiment of HeatMap 1358 in this case contains two rows corresponding to two sub-blocks of each data block and four columns corresponding to the four possible signature values. As shown in Table 2, all entries of Heatmap 1358 are initialized to {(0, 0, 0, 0), (0, 0, 0, 0)}. Whenever a data block is accessed, the popularities of corresponding sub-signatures in HeatMap 1358 are incremented. For instance, the first block has logical block address (LBA) of LBA1 with content (A, B) and corresponding signatures (a, b) for two sub-blocks. As a result of the I/O request, two popularity values in HeatMap 1358 are incremented corresponding to the two sub-signatures, and HeatMap 1358 becomes {(1, 0, 0, 0), (0, 1, 0, 0)} as shown in Table 2. After 4 requests of various data blocks, HeatMap 1358 becomes {(2, 1, 1, 0), (0, 1, 0, 3)} based on the accumulation of sub-signature occurrences.
The computation overhead to generate and maintain HeatMap 1358 may be substantially reduced over other data similarity counting techniques. Also, although hashing may be a computation efficient technique to detect identical blocks, it may also lower the chance of finding a similarity because a single byte change results in a totally different hash value. Therefore, hashing by itself may not help in finding more similarities. On the other hand, an LPU algorithm may calculate the secure hash value (e.g. SHA-1) of a data block to determine if a block is identical to another.
In an alternate example of a two-dimensional HeatMap 1358, taking a set of 4 KB blocks divided into 512B sub-blocks with 8 bits sub-signature for each sub-block, HeatMap 1358 with 8 rows corresponding to 8 sub-blocks (8=4K/512) and 256 columns corresponding to all of the possible 8-bit signatures for a sub-block can be used. When a block is read or written, its 8 one-byte sub-signatures may be retrieved and the 8 values of corresponding entries in HeatMap 1358 (also referred to herein as popularity values) may be increased by one. Use of these frequency spectrum aspects of content may differentiate the LPU algorithms from conventional caching algorithms. As noted above, embodiments of the LPU algorithm may capture both temporal locality and content locality of data being accessed by a host processor. If a block of the same address is accessed twice, the increase of a corresponding popularity value in HeatMap 1358 may reflect temporal locality. On the other hand, if two similar blocks with different addresses are each accessed once, HeatMap 1358 can identify the content locality of these two blocks. For example, the popularity values of matching sub-signatures in the two blocks may be incremented in HeatMap 1358. In this way, popularity may be determined based on frequency and recency of a signature associated with active I/O operations. In an example, if a signature is shared by many active I/O blocks, then the signature is popular. In some embodiments, block popularity may be based on block and sub-block signature popularity. A block that contains many popular signatures may be classified as reference block and therefore may be cached and used with the various delta generation and caching techniques described herein. Because many other active I/O blocks may share content with this reference block, the net result is a higher cache hit ratio and more efficient delta compression with respect to many other associated blocks that share such popular sub-signatures.
In some embodiments, to capture the dynamic nature of content locality at runtime, the LPU algorithms may enable scanning cached blocks after a programmable number of I/O requests. This number of I/O requests may define a scanning window. At the end of each scanning window, the LPU algorithm may examine the popularity values in Heatmap 1358 and choose the most popular blocks as reference blocks. An objective of selecting a reference block is to identify a cached data block that may contain most frequently accessed sub-blocks so that many frequently accessed blocks share content with it. The reference block may be selected such that the number of remaining blocks that have small differences (deltas) from the reference block may be maximized. In this way, more I/O requests may be served by combining the reference block with small deltas. Once HeatMap 1358 has been examined at the end of the scanning window, the HeatMap values may be reset to enable variations of popularity over time to influence the LPU algorithm and determination of reference blocks in the cache.
Table 3 illustrates an example calculation of popularity values and cache space consumption using different choices of a reference block for the example of Table 2. The popularity value of a data block may be the sum of all its sub-block popularity values in HeatMap 1358. As shown in Table 3 below, the most popular block is the data block at address LBA3 with content (A, D). Its popularity value is 5. Therefore, block (A, D) may be chosen as the reference block. Once the reference block is selected, the LPU algorithm uses delta-coding to eliminate data redundancy. The result shows that using the most popular block (A, D) as the reference, cache space usage is minimum—about 2.5 cache blocks assuming near-perfect delta encoding. In contrast, without considering content locality, a conventional Least Recently Used caching algorithm would need 4 cache blocks to keep the same hit ratio. The space saved by applying an LPU algorithm may be used to cache even more data.
HeatMap 1358 supports cache management of the content locality based cache. In some embodiments, as described above, HeatMap 1358 can store a frequency and recency of fingerprints that are read and written during I/O operations. If a fingerprint is touched frequently and recently during I/O operations, the content represented by the fingerprint may be considered to be popular. The content locality based cache can determine content locality based on identifying content considered to be popular. If the sketch of a data block contains mostly popular fingerprints, the data block may considered to be popular. The popularity value of data blocks may be used in the cache algorithm. To quantify the popularity of data blocks, HeatMap 1358 can track popularity value for each fingerprint. For example, with a fingerprint of 8 bits, there are 256=28 possible fingerprint values. Accordingly, HeatMap 1358 illustrates an example 8×256 table for 8 fingerprints per sketch. When the content locality based cache processes a received I/O operation, the sketch or the 8 fingerprints of the block may be used to update HeatMap 1358. For example, the 8 fingerprints may be processed using an 8-to-256 decoder to increment the popularity value of the corresponding table entry. As time passes, the higher the popularity value, the hotter the corresponding data content may be considered to be. The hotter the corresponding data content, the more the corresponding data content should stay in the cache to increase a chance of a cache hit. Eventually, the popularity value may reach a maximum that can be represented by the length of each entry. In some embodiments, at that time or after each scanning cycle, all entries in the HeatMap can be decremented by a fixed value to preserve relative popularities among the entries. In further embodiments, HeatMap 1358 may also be reset to all 0's upon the start of a new application program or completion of one application.
A virtual block list (VBL) may be used with the LPU algorithm for read and for write requests. Generally upon either a read or write request, the LBA is looked up in the VBL. If the LBA is found, then the type of block is determined from metadata in the corresponding VBL entry. Subsequent actions are generally based on the type of block and the type of request (read or write).
For a read operation, the following actions may be available:
For a write operation, the following actions may be available
In some embodiments, a delta that may be stored in a delta page may be derived at run time representing the difference between the data page of an active I/O operation and its corresponding reference page stored in RAM or SSD 304 (shown in
In further embodiments, a component of the DRIPStore design may be to identify reference pages. To identify reference pages quickly, some embodiments may further divide reference pages into at least two different categories: (1) reference pages that may have exactly the same LBAs as deltas, and (2) data blocks that may be newly generated and may have LBAs that do not match a current reference page stored in SSD 304. The first reference page category may contain reference pages that may have exactly the same LBAs as deltas. An example of a reference page in this first category is a data block that has been modified since it was designated as a reference block; therefore while the reference block may still be useful to the caching system, the physical data to be stored in primary storage requires this reference page to be combined with a delta page. The second category may consist of data blocks that may be newly generated and may have LBAs that do not match any one of the reference pages stored in SSD 304.
To facilitate similarity detection of blocks and/or reference blocks, for each data block, the DRIPStore process described herein may compute block sub-signatures. Generally, a one byte or a few bytes signature may be computed from several sequential bytes of data in data block 408 (shown in
An exemplary implementation of DRIPStore may compute 1-byte sub-signatures of every 3 consecutive bytes in a data block, i.e. k=3. The DRIPStore process may then select the 8 most frequent sub-signatures for signature matching, i.e. f=8. In an example, for a 4 KB block, the DRIPStore process may first calculate the hash values of all 3 consecutive bytes to obtain 4K−2 sub-signatures. If the number of matches between a block and the reference exceeds 6, this block may be associated with the reference. Based on experimental observations, this sub-signature with position mechanism may recognize not only shifting of content but also shuffling of contents.
The data blocks to be examined for similarity detection may be determined based on performance and overhead considerations. Content locality may exist in a storage system both statically and dynamically. Accordingly, in some embodiments data redundancy may be identified in one of two ways: (1) periodic scanning, and (2) identifying similar blocks online based on cache contents. First, a scanning thread may be used to scan the storage device periodically. A static scan may be easy to implement since data may be fixed and the scan may achieve a good compression ratio by searching for the best reference blocks. However, a static scan may read data from different storage devices and the similar blocks found may not necessarily have tight correlation other than content similarity. The DRIPStore algorithm described herein may take a second approach which may identify similar blocks online from the data blocks already loaded in a cache. For a write I/O, a corresponding reference block for delta compression may be found. If the write I/O were a new write with no prior reference block, a new reference block may be identified for that write I/O. For a read I/O, as soon as the delta corresponding to the read I/O may be loaded, a reference block may be found to decompress to the original data block.
In some embodiments, CIP-List 3200 may be a linked list that may contain metadata associated with cached pages such as pointers and LBAs. Typically, each node in the list may need tens of bytes, resulting in less than 1% space overhead for page size of 4 KB. In addition to a head pointer 3210 and a tail pointer 3212 of the linked list, the CIP adds a SSD pointer 3214 to point at the top of the SSD sub-list 3204 and the candidate pointer 3216 to point at the top of candidate sub-list 3208, respectively.
There may be three types of replacements in the CIP algorithm. A first replacement may include replacing a page from RAM sub-list 3202 to SSD sub-list 3204. A second replacement may include replacing a page from SSD sub-list 3204 to HDD 308. A third replacement may include replacing a candidate page from candidate sub-list 3208 to HDD 308. These replacements may happen at or near the bottom of each sub-list, similar to the LRU list. That is, the higher position a page is in CIP-List 3200, the more important the page may be and the less likely that it may be replaced. The CIP algorithm may conservatively insert a missed page at the lower part of CIP-List 3200 and may let the missed page move up gradually as re-references to the page occur. This may facilitate managing a multi-level cache that may consider recency, frequency, inter-reference interval times, and bulk replacements in SSD 304.
In embodiments, page reference recency information may be used for managing the cache for many different workloads. This may be why an LRU algorithm has been popular and used in many cache designs. The CIP algorithm may maintain the advantages of LRU design by implementing candidate sub-list 3208, RAM sub-list, or SSD sub-list as a LRU list. Candidate sub-list 3208 may contain pages that may be brought into RAM upon misses or it may contain only metadata of pages that have been missed once or only a few times even though the data is not yet cached. Upon a miss, the metadata of the missed page may be inserted at or near the top of candidate sub-list 3208 and may be given an opportunity to show its importance to stay in the candidate-list until the LCth miss before it may be replaced. If it gets re-referenced during this time, it may be promoted to the top or at least near the top of RAM sub-list 3202. Pages at the bottom of the RAM sub-list are accumulated to form a batch to be written to SSD 304 at which time their metadata is placed in SSD sub-list 3204. The number of re-references, maximum time required between re-references, and other aspects that may impact a decision to promote a page within CIP-list 3200 may be tunable. In this way a page may get promoted if it is re-referenced only twice within a predetermined period of time or it may require several re-references within an alternate predetermined period of time to be tagged for promotion. A promotion algorithm may also depend on block size versus I/O access size so that even when an 8K block is accessed twice due to the I/O access size being 4K, a 4K page stored in the Candidate sub-list may not be promoted upon the second access to the candidate block to retrieve the second 4K page of the 8K block. Since SSD 304 favors batch writes, the SSD write may be delayed until B such pages have been accumulated on top of SSD sub-list 3208. During this waiting period, if the page is re-referenced again, it may be promoted to RAM sub-list 3202 because inter-reference interval time of this page is small showing the importance of the page indicates that it should be cached in the RAM. Therefore, CIP-List 3200 may automatically maintain both recency and inter-reference recency information of cached pages taking advantages of both LRU and LIRS cache replacement algorithms.
In some embodiments, to take into account reference frequency information in managing cache replacement, a new page to be cached in the RAM cache may be inserted at lower part (IR) 3218 of RAM sub-list 3202 and may get promoted one position up in the list upon each reference or upon a configurable number of references. Similarly, in SSD sub-list 3204, any reference (or configurable number of references) may promote the referenced page up by one position (or a configurable number of positions) in CIP-List 3200. As a result of such insertion and promotion policy, the relative position of a page in CIP-List 3200 may approximate the reference frequency of the page. Frequently referenced pages may be unlikely to be evicted from the cache because they may be high up in CIP-List 3200. For RAM sub-list 3202, IR 3218 may be a tunable parameter that may determine how long a newly inserted page may stay in the cache without being re-referenced. For example, if IR 3218 is at the top of CIP-List 3200, it is equivalent to LRU. If IR 3218 is at the bottom of CIP-List 3200, the page may be replaced upon next miss unless it is re-referenced before the next cache miss. Generally, IR 3218 may point at the lower half of RAM sub-list 3202 so that a new page may need to earn enough promotion credits (e.g. have a high reference frequency) to move to the top and yet it may be given enough opportunity to show its importance before it is evicted. For SSD sub-list 3204, insertion may always happen at the top of CIP-List 3200 where B pages may be accumulated to be written into SSD 304 in batches. Once the recently added B pages are written into SSD 304, their importance may depend on their reference frequency since each time a page is referenced its position in the CIP list may be promoted further up the list. The pages at the bottom of the list may not have been referenced for a very long time and hence may become candidates for replacement when SSD 304 is full. The CIP algorithm may try to replace these pages in batches to optimize SSD 304 performance.
In addition to being able to taking into account recency, frequency, and inter-reference recency, the CIP algorithm may help avoid the impact of mass storage scans and other types of mass storage sweep accesses on cached data and may be able to automatically filter out large sequential accesses so that they may not be cached in SSD 304. This may be done by candidate sub-list 3208. Pages in a scan access sequence may not make to the RAM sub-list or SSD sub-list 3204 if they are not re-referenced and therefore may be replaced from the candidate buffer before they can be cached in the RAM or SSD 304. Pages belonging to a large sequential scan accesses may be detected by comparing the LBA of a node in the candidate list and the LBAs of current/subsequent I/Os and using a threshold counter. In embodiments, for cache hits, the algorithm may work in the following manner. If the referenced page, p, is in RAM sub-list 3202 of the CIP-List 3200, p may be promoted by one position up if it is not already at the top of CIP-List 3200. Upon a read reference to page p that may be in SSD sub-list 3204 of CIP-List 3200, p may be promoted by one position up if it is not already among the top of B+1 pages in SSD sub-list 3204. If p is one of the top B+1 pages in SSD sub-list 3204, p may be inserted at the IR position of RAM sub-list 3202. Further, if the size of RAM sub-list 3202 is LR at time of the insertion, the page at the bottom of RAM sub-list 3202 may be demoted to the top of SSD sub-list 3204 and its corresponding data page may be moved from the RAM cache to the block buffer to make room for the newly inserted page. The block counter in the SSD pointer may be incremented. If the counter reaches B, SSD_Write may be performed.
Upon a write reference to page p that is in SSD sub-list 3204 of CIP-List 3200, p may be removed from SSD sub-list 3204 and inserted at IR 3218 position of RAM sub-list 3202. If the size of RAM sub-list 3202 is LR at time of the insertion, the page at the bottom of RAM sub-list 3202 may be demoted to the top of SSD sub-list 3204 and its corresponding data page may be moved from the RAM cache to the block buffer to make room for the newly inserted page. The block counter in the SSD pointer may be incremented. If the counter reaches B, SSD_Write may be performed. In addition, if the referenced page, p, is in candidate sub-list 3208 of CIP-List 3200, p may be inserted at the top of SSD sub-list 3204 and the corresponding data page may be moved from the candidate buffer to the block buffer. The counter in the SSD pointer may be incremented. If the counter reaches B, SSD_Write may be performed.
In another embodiment, for cache misses, the algorithm may work in the following manner. If RAM cache is not full, the missed page p may be inserted at the top of RAM sub-list 3202 and the corresponding data page is cached in the RAM cache. If RAM cache is full, the missed page p may be inserted at the top of candidate sub-list 3208 and the corresponding data page may be buffered in the candidate buffer or not cached at all. If the candidate buffer is full, the bottom page in candidate sub-list 3208 may be replaced to make room for the new page.
An SSD_Write may proceed as follows. If SSD is full, i.e. SSD sub-list 3204 size equals LS, the CIP algorithm may destage the bottom B pages in SSD sub-list 3204 to HDD 308. Only dirty destaged pages need to be read from SSD 304 and written to HDD 308. Next, the CIP algorithm may perform SSD writes to move all dirty data pages in the block buffer to SSD 304 followed by clearing the block buffer and the block counter in the SSD pointer of the CIP-List.
Similarly, some embodiments may use a linked list or a simple table (i.e., array structure) for the candidate list. The table may be hashed by using LBAs. Each entry may keep a counter to count a number of cache misses that have occurred since the entry was added to the candidate list so that the corresponding data may be promoted to be cached once its counter exceeds a threshold. Exceeding such a threshold may indicate that data in the cache is stale and therefore performance may be improved by promoting candidate data to the cache to replace stale data. Each entry may also be configured with a timer that impacts a re-reference counter for the entry. The re-reference counter may be reset to 0 once the time interval, determined by the timer, between two consecutive accesses (successive re-references) to the same block exceeds a predetermined value. This interval between references may be calculated on each I/O access to the same block by subtracting the current I/O access time-of-day and previously stored access time-of-day value in the corresponding table entry.
Each sub-list of CIP-list 3200 may include some overlapping pages. In an example, some of the pages in the RAM-list may also exist in the SSD list because a page in the SSD may have been promoted to the RAM and the page in SSD may be unaffected until other pages are promoted to the SSD-sublist. This may not pose any significant problem because a RAM list may be checked for presence of a page before an SSD list is checked.
The methods for sub-signature related algorithm selection described herein may calculate a plurality of sub-signatures for each distinct sub-signature calculation algorithm (e.g. sub-sig N, sub-sig N+1, sub-sig N+2 and sub-sig N+M 3902) for a portion of data 3906 associated with application 3908. In an example, distinctly calculated sub-signatures may be sampled using at least two distinct sub-signature sampling algorithms 3910. Further, counts of reference blocks and associated blocks for each of the sampled sets of distinctly calculated sub-signatures may be determined and stored in the processor accessible memory 3912. For further facilitating similarity-based detection, counts of false positives for each of the sampled sets of distinctly calculated sub-signatures may be calculated and stored in the processor accessible memory 3912. The stored counts (reference and associated, and false positives) may be analyzed to result in selecting a distinct combination of a sub-signature calculation and a sampling algorithm. The selected sub-signature sampling algorithms produces at least one of the largest count of reference and associated blocks and the smallest count of false positives for performing similarity detection of data associated with the application.
The techniques described herein for efficient signature and sub-signature calculation, signature sampling methods, algorithm comparison and selection techniques, and the like may be employed in a variety of environments, including in various cache management methods and systems. Several such cache management methods and systems are described herein and may include content/spatial/temporal locality-based similarity detection and delta compression, conservative insertion and promotion of cachable data blocks, popularity-based techniques (e.g. Least Popularly Used), DRIPStore, HeatMap-based signature popularity techniques, data virtualization, and other similarity, compression, cache management, and SSD management techniques, methods, and systems as described herein. The techniques described herein for efficient signature and sub-signature calculation, signature sampling methods, algorithm comparison and selection techniques, and the like may replace or supplement similar techniques described herein as being used in various cache management-related embodiments.
Embodiments of methods and systems for fast, accurate similarity detection described herein, particularly as depicted in
Features of a similarity detection algorithm can include: (i) taking on the order of 10 microseconds; (ii) comprehensively detecting a high percentage of possible similar blocks; (iii) generating a minimal number of false positive detections, because each false positive detection can waste computing resources and possibly delay I/O operations that the cache management techniques are designed to speed-up.
Finding resemblance of two or more files/documents/data streams facilitates compressing the files, such as by using delta encoding. Similarity detection of two files/documents/data streams (herein “compression target”) may be done by representing each document using a set of shingles. Shingles may be derived by sliding a window of θ bytes (also referred to herein as a shingle size) from the beginning to the end of the compression target one byte at a time. If the compression target contains β bytes (e.g. 4 KB to 64 KB), the methods process a total of β−θ+1 shingles. The degree of similarity between the two compression targets may then be determined based on the number of shingles shared by the two compression targets.
Comparing all processed shingles of the two compression targets may result in accurate similarity detection. However, the computation cost for this comparison may also be high. Therefore, it may be important to determine how many shingles to compare, and how to select a subset of shingles to compare without loss of accuracy. This determination may be similar to a sampling problem, which may be addressed by the design and selection of efficient similarity detection algorithms as described herein.
An initial issue to address is how big the shingle size should be, determining θ which may be a trade-off between accuracy and efficiency. If θ is the size of a machine word, then similarity detection becomes a word to word comparison of the two compression targets, implying low efficiency. If θ is too large, on the other hand, it may be easy to miss many similar data blocks in the compression target with small changes, such as one word insertion or one byte overwrite. A common range for θ may be in the range of tens of bytes to hundreds of bytes.
To increase storage and computation efficiency, a computed fingerprint (e.g., signature, hash, and the like) of a processed shingle may be compared, instead of comparing each processed shingle. Fingerprint generation may result in a probability that two different shingles will generate the same signature being extremely small, so that the chances of signature collision become very small or even negligible in practice.
A similarity detection algorithm may be thought of as including a few steps such as: determining shingle size, calculating signatures of the shingles, selecting a sample of signatures (e.g. a sketch), and finally comparing the corresponding signatures of the two compression targets to determine the degree of similarity. A similarity detection algorithm described herein may be referred to as FASD, for fast/adaptive similarity detection. A key observation is that compression target data actively accessed by applications shows content locality (regularity and similar pattern) during a short time frame (typically daily or hourly). The FASD algorithm employs algorithm selection techniques to adapt to these active data patterns to provide highly efficient and accurate similarity detection. FASD facilitates selecting best-fit shingling and signature computation algorithms and a best fit sampling and finalization algorithms of signature candidates to be used for similarity detection of at least the remaining portion of the compression target data.
Referring again to
Subroutine 1: Use a shingle size of 3 bytes to calculate β-2 1-byte signatures. Each signature may be an addition of 3 bytes. Leveraging the register structure of some common processors (e.g. based on x86 architecture), 128 byte additions can be processed in parallel so that all β-2 signatures can be done very quickly by parallel additions and register shifts.
Subroutine 2: Use a shingle size of 8 bytes to calculate β-7 1-byte signatures. Each signature may be one byte checksum of the corresponding 8 bytes. Making use of the hardware support in common processors for generating a CRC checksum, the checksums can be calculated very quickly. Notice that a CRC generating polynomial is not necessarily irreducible, because it usually requires generating polynomial to have (x+1) as a factor in order to detect all odd number bits errors.
Subroutine 3: Use a shingle size of 4, 8, or more bytes to calculate signatures of length 19 or 31 by doing mod operations using Mersenne primes as a modulus to calculate signatures with high speed and low collision probability. An example of subroutine 3 that assumes a shingle size of 8B, fingerprint length of 19 bits, and 4 KB block is now presented:
Choose a Mersenne prime, say 19 bits: P=219−1=0x7FFFF;
Calculate the remainder dividing the first 8B, A=[b1:b2:b3 . . . b8], of the data block by 0x7FFFF. To avoid division that would take over 40 cycles, subroutine 3 may perform addition instead. Subroutine 3 first partitions an 8B string (64 bits) into 19-bit pieces starting from the least significant bits resulting in [A1:A2:A3:A4], where A1 has only 7 bits.
A=A
1*257+A2*238+A3*219+A4
since
A1*257 mod(219−1)=A1, A2*238 mod (219−1)=A2, and A3*219 mod(219−1)=A3, note that 219i mod(219−1)=1 holds always.
The result is the first signature
with the carry bit wrapped around and added to the LSB of the sum.
Suppose the 8B shingle (64 bits) is stored in two 32-bit data registers denoted DH and DL for higher order word and lower order word, respectively. A result is the computation of the above equation involves only shifts and additions, which are faster to execute on a processor than other operations that are more complicated and may require more computation time:
S
1
=D
L&P+DL>>19+(DH&0x3F)<<13+(DH>>6&P)+DH>>25 Equation (1)
For the remaining P-6 signatures, subroutine 3 may include:
Equation (2) may require 3 shifts, 2 XOR, and 1 addition operations irrespective of the length of shingle size.
If the shingle size is 4B and fingerprint length is 19 bits, a similar procedure is described below.
Choose a Mersenne prime 19 bits: P=219−1=0x7FFFF;
Calculate the remainder dividing the first 4B, A=[b1:b2:b:b4], of the data block by 0x7FFFF. The system partitions the 4B string (32 bits) into a lower 19-bit string and a remaining high order 13-bit string denoted by [A1:A2], where A1 has 13 bits and A2 has 19 bits.
A=A
1*219+A2
This calculation provides a first signature
with the carry bit wrap around added to the least significant bit of the sum.
Note that
A1=A>>19, i.e., a logic shift to the right by 19 bits, and
A2=A&P.
Therefore, the computation of A1+A2 involves only shifts and additions and may be given by:
S
1
=A>>19 +A&P, with the carry bit wrapped around. Equation (3)
For the remaining 4K−2 signatures, the system may perform the same computation for each 4B word:
for a shingle size of 4B and fingerprint size of 19 bits.
In general, if the shingle size is small relative to the exponent of the Mersenne prime, the method can carry out the computation for each shingle using Equations (3) and (4). If the shingle size is large, e.g., larger than 8B, the system can calculate the first signature and then recursively calculate the remaining signatures. Let the shingle size be θ bytes (θ>8B) and signature size of μ A bits (length of the Mersenne prime). The system may calculate the first signature as follows:
Partition the first θ bytes of a data block into
segments from the LSB to MSB, the last segment containing the MSB may have less than μ bits; (this computation can be done using mask and shift operations)
Add all
segments with carry bits wrapped around and added to the LSB;
The sum may be the first signature.
Once the first signature has been calculated, the system may compute the remaining signatures as follows:
Subroutine 4: Generate a random irreducible polynomial for each shingle. This generation may be done in the following manner:
Denoting the byte strings by b1, b2, b3, . . . bn and taking the shingle size to be 8, the signature of the first shingle may be derived as:
S
1=(b1*p7+b2*p6+b3*p5+b4*p4+b5*p3+b6*p2+b7*p+b8)mod M,
S
1=(p*(( . . . (p*(p b1+b2)+b3) . . . ))+b8)mod M.
The 2nd and the rest of the signatures may be calculated using the previously calculated signature as follows:
S
i+1=(p*(Si−(bi*p7))+bi+7) mod M, for i=1, 2, . . . , β−7.
Subroutine 5: Using a shingle size of 8 to 128 bytes to calculate Rabin fingerprints of length 16 or 32 recursively, making use of previously computed fingerprints. For illustrative purposes, assume a shingle size of 8B, fingerprint length of 32 bits, and 4 KB block. For other parameters, the algorithm may be generalized.
Choose an irreducible polynomial of degree 32, g(x);
Calculate the remainder dividing the first 8B, [b1:b2:b3 . . . b8], of the data block by g(x);
S
1
=[b
1
:b
2
:b
3
. . . b
8] mod g(x)
S1 may be determined using a slicing-by-8 method or any other method for 32-bit CRC computation on 8B. Note that the speed of computing this first CRC is not significant, since the first CRC may be computed only once per block and may represent a small fraction of the total computation of all 4K−7 fingerprints.
The remaining 4K−6 signatures may be given by
where RSb1, RSb2, RSb3, RSb4 represent remainders of each of the four bytes in bi256⊕Si divided by g(x), and may be given respectively by
R
Sb1=232*1st byte of (bi256⊕Si)mod g(x),
R
Sb2=224*2nd byte of (bi256⊕Si)mod g(x),
R
Sb3=216*3rd byte of (bi256⊕Si)mod g(x)
R
Sb4=28*4th byte of (Bi256⊕Si)mod g(x)
In some embodiments, Equation (7) uses five XOR operations and five table lookups, irrespective of the length of shingle size. The five tables store the remainder divided by g(x) of a byte shifted to the left by 7 bytes, 4 bytes, 3 bytes, 2 bytes, and 1 byte, respectively.
If the fingerprint length is 16 bits or 2 bytes, then the system may use three table lookups and three XOR operations for each signature, because both bi256 and Si are two bytes long. Equation (7) may thereby become:
S
i+1
=R
Sb1
+R
Sb2
+b
i+8
Referring again to
Referring again to
Referring again to
Referring again to
Referring again to
S
f
=S
σ& 0x7F;
The frequency based sampling techniques discussed above have the advantages of catching signatures that identify the most frequently accessed segments in the I/O path and therefore help LPU cache design (LPU denotes Least Popularly Used data replacement cache algorithm and is described herein). However, for some data sets, random sampling may give better performance.
Referring again to
After the random sampling of step B.1., in operation B.2. the sampling builds a histogram of the Ω signatures. The sampling then selects the eight most frequent signatures. These eight signatures may be (μ-Y) bits each. The sampling then selects one byte among the (μ-Y) bits or does mod 27−1 operations to obtain the final eight 1B signatures.
In another sampling operation B.3., on each 4 KB data block, the sampling may calculate only thirty-two signatures, each of which is thirty-one bits resulting from the modulo operation on the 31-bit Mersenne prime. Among the thirty-two signatures, the first four may be calculated on the four shingles at the middle of the first 512B of the 4 KB data block, the second four may be calculated at the middle of the second 512B, and so on, giving rise to 32 signatures total because there eight 512B subblocks in a 4 KB data block. For example, the sampling may start at byte location 256 with shingle size 50B to calculate the first signature based on Mersenne primes. Then the sampling slides the shingle by 1 byte to calculate the second signature for byte 257 through byte 306, until four signatures are obtained. Then the sampling starts the 5th signature at byte location 768, and so on. After the sampling calculates the thirty-two signatures, the sampling performs either:
Frequency histogram to select the top eight most frequent signatures and reduce them from 32 bits to 8 bits by choosing the MSB or doing mod 27−1 as follows. For each of the 8 signatures, Sσ, the sampling performs:
S
f
=S
σ& 0x7F;
loop:
Or
Heap sort the thirty-two signatures to select eight signatures that have the least signature values. Then, the sampling may use the same algorithm above to reduce signatures from thirty-two bits to eight bits.
Since the basic data unit in I/O operations is a sector or 512B, the sampling techniques are aware of this fact. This is the rationale behind subroutine B.3. above. The generalized algorithm for subroutine B.3. is given below.
Inputs: A data block of 0 bytes (4K to 64K in our case)
Outputs: Eight (or any chosen number, NoSig) 1B signatures (or a few bytes, SigL) as a sketch of the block for similarity comparison purposes
Parameters (tunable): Shingle size: θ; Number of shingles sampled per sector: ω; Starting offset in sector i for signature computation/sampling: ψn for n=0, 1, . . . , N, where N is the total number of signatures computed in a program run; A Mersenne Prime: P.
Procedures:
ψ0=64;
1) Calculate the first signature starting at byte ψn+512*j as follows:
segments from the LSB to MSB, the last segment containing the MSB may have less than μ bits, this computation can be done using mask and shift operations as exemplified by Equation (1);
segments with carry bits wrapped around and added to the LSB;
2) For i=1 to ω−1 do
3)
For all
signatures, do heap sort and select the least eight (or NoSig) signatures; (occurrence frequency may be considered while sorting);
Reduce each of the eight signatures, Sσ from μ bits to eight (or SigL) bits according to:
S
f
=S
σ& 0x7F;
loop:
Referring again to
Starting with an initial signature match threshold, for example three out of eight matching signatures, if at least three of subset of sampled signatures match between two blocks of data, the two blocks are identified as similar. However, if a configurable number of false positive detections are found, an automated signature match threshold configuration facility may increase this signature match threshold.
Likewise, if a number of associated/reference blocks generated using the similarity detection techniques described herein is lower than a predetermined number, the automated signature match threshold configuration facility may decrease the signature match threshold. After a few iterations (e.g. two or more), an optimal threshold value may be determined.
This process may be done on each scanning cycle.
Method 4810 can receive a block for caching (step 4812). The system can divide the block into subsets, or “shingles” (step 4814). For example, the size of the received block can be 4 KB, and the corresponding size of the shingle can be 8 bytes. (Accordingly, for an example block of size 4 KB, there can be 4K-7 shingles corresponding to various subsets of the block.)
For each shingle (step 4816), method 4810 can determine, using a fingerprint circuit, an intermediate fingerprint by processing the shingle (step 4818). In some embodiments, determining the intermediate fingerprint can include computing a hash value for the shingle. In some embodiments, the fingerprint circuits, also referred to herein as signature computation circuits, can process shingles in parallel using multiple fingerprint circuits. The parallel processing can determine multiple fingerprints of multiple shingles concurrently, faster than using a fingerprint circuit for serial or sequential processing. The intermediate fingerprint can be used as a “temporary” fingerprint that represents a current representative fingerprint for a single shingle. In some embodiments, determining the intermediate fingerprint can use Mersenne primes, Rabin fingerprinting, random irreducible polynomials, or other methods that result in a smaller sub-signature than the received shingle. The Mersenne primes, Rabin fingerprinting, and random irreducible polynomials can generally represent content of a shingle. In some embodiments, if the content locality cache uses eight-way parallel fingerprint circuits, the system can generate eight fingerprints using different terms for each fingerprint circuit. For example, if the parallel fingerprint circuits use Rabin fingerprinting, each fingerprint circuit can use different polynomials for the Rabin fingerprinting. If the parallel fingerprint circuits use random irreducible polynomials, each fingerprint can use a different prime modulo for the random irreducible polynomial. A smaller sub-signature can be computationally easier to process, while still representing the contents of the block for use in detecting similarity with reference blocks.
Method 4810 can determine whether the intermediate fingerprint is more representative of the overall contents of the block than a previous fingerprint (step 4820). In some embodiments, determining whether the intermediate fingerprint is more representative can use min wise independent permutations locality sensitive hashing by selecting a minimal fingerprint for the shingles processed by the fingerprint circuit. In other embodiments determining whether the intermediate fingerprint is more representative can select a maximal fingerprint for the shingles by retaining high-order bits of the intermediate fingerprint and discarding low-order bits. Because the fingerprint circuits can process more than one shingle, the previous fingerprint stored in the fingerprint buffer can be a representative fingerprint for all shingles processed so far by the fingerprint circuit. Some shingles can be expected to result in intermediate fingerprints that are relatively higher or lower. In some embodiments, determining whether the intermediate fingerprint is more representative can include selecting an intermediate or previous fingerprint that is maximal (or minimal) for all shingles processed by the fingerprint circuit. Selecting a maximal or minimal fingerprint can generally result in a better and faster measure of the content of the received block by sampling shingles. Selecting a maximal or minimal fingerprint can allow the system to determine similarity of data blocks by performing fast set union and set intersection operations on the minimal or maximal fingerprints. Further description of the min wise independent selection can be found in Andrei Z. Broder, “On the resemblance and containment of documents,” Compression and Complexity of Sequences: Proceedings, Positano, Amalfitan Coast, Salerno, Italy, Jun. 11-13, 1997, IEEE, pp. 21-29, the entire contents of which are incorporated by reference herein.
If the intermediate fingerprint is determined to be more representative of the contents of the block (step 4820: Yes), the intermediate fingerprint can be stored in the fingerprint buffer as the representative fingerprint (step 4822). For example, determining the intermediate fingerprint to be more or less representative of the contents of the block can include determining whether the intermediate fingerprint is greater than or less than the previous fingerprint, depending on whether a maximal or minimal fingerprint is used for sampling. If the intermediate fingerprint is determined to be more representative, the intermediate fingerprint can therefore replace the previous fingerprint that was initially stored in the fingerprint buffer. If the intermediate fingerprint is determined to be less representative of the contents of the block (step 4820: No), method 4810 can keep the previous fingerprint in the fingerprint buffer as the representative fingerprint for the shingles that have been sampled by the fingerprint circuit (step 4824). If there are more shingles to process (step 4826: Yes), method 4810 returns to process a subsequent shingle. If there are no more shingles to process (step 4826: No), the system can use the representative fingerprints stored in the fingerprint buffers as the representative fingerprints for the received block (step 4828).
Fingerprint circuits 1340a, 1340d can perform intermediate computations to determine the intermediate fingerprints.
The fingerprint circuit can receive a shingle for processing (step 4922). The fingerprint circuit can process multiple shingles of a data block in succession to compute a signature. Furthermore, in some embodiments multiple fingerprint circuits can be arranged in parallel to compute multiple corresponding signatures in parallel for a data block.
Determining a first intermediate fingerprint by processing the received shingle based on linear additions and bit-shifting (step 4924) can include dividing the received shingle into subfields and performing addition among the subfields. For example, the fingerprint circuit can divide the received shingle into four subfields and use adders to add the four subfields and compute the modulo operations corresponding to the Mersenne prime using adders that perform quickly. In some embodiments, the fingerprint circuit can use a first stage of adders to add two groups of subfields, followed by a second stage of adders to add the two groups. If an example of a received shingle is 64-bits and an example Mersenne prime of 219−1 is used, an example of the first intermediate fingerprint can be 19 bits after processing using the two stages of adders. The intermediate fingerprint can be bit-shifted by a coefficient Ai to apply a random permutation. Using min wise independent selection, the random permutation can generally provide an improved representation of the contents of the shingle and data block being analyzed. In some embodiments, if the fingerprint circuit is repeated in parallel, a different coefficient can be chosen for each i′th fingerprint circuit.
Determining a second intermediate fingerprint by processing the first intermediate fingerprint based on linear additions with a random constant (step 4926) can include using an adder to add a random coefficient Bi. Using min wise independent selection, the random constant can also generally provide an improved representation of the contents of the shingle and data block being analyzed. In some embodiments, if the fingerprint circuit is repeated in parallel, a different coefficient can be chosen for each i′th fingerprint circuit. In some embodiments, the determining the second intermediate fingerprint can result in a 16-bit intermediate fingerprint for comparison with a previous 16-bit fingerprint in the fingerprint buffer.
Method 4920 can determine whether the intermediate fingerprint is more representative of the overall contents of the block than a previous fingerprint (step 4820). Because the fingerprint circuits can process more than one shingle, the previous fingerprint stored in the fingerprint buffer can be a representative fingerprint for all shingles processed so far by the fingerprint circuit. Some shingles can be expected to result in intermediate fingerprints that are relatively higher or lower. In some embodiments, determining whether the intermediate fingerprint is more representative can include selecting an intermediate or previous fingerprint that is maximal (or minimal) for all shingles processed by the fingerprint circuit. Selecting a maximal or minimal fingerprint can generally result in a better measure of the content of the received block by sampling shingles.
If the second intermediate fingerprint is determined to be more representative of the contents of the block (step 4820: Yes), the second intermediate fingerprint can be stored in the fingerprint buffer as the representative fingerprint (step 4822). For example, determining the intermediate fingerprint to be more or less representative of the contents of the block can include determining whether the intermediate fingerprint is greater than or less than the previous fingerprint, depending on whether a maximal or minimal fingerprint is used for sampling. If the intermediate fingerprint is determined to be more representative, the intermediate fingerprint can therefore replace the previous fingerprint that was initially stored in the fingerprint buffer. If the intermediate fingerprint is determined to be less representative of the contents of the block (step 4820: No), method 4920 can keep the previous fingerprint in the fingerprint buffer as the representative fingerprint for the shingles that have been sampled by the fingerprint circuit (step 4824).
Fingerprint circuit 1340a can receive shingle 4806a as input. For example, shingle 4806a can be a 64-bit shingle, or any other size shingle that represents a subset or window of a data block. Fingerprint circuit 1340a can divide shingle 4806a into subfields 4902a-4902d. Fingerprint circuit 1340a can perform addition among subfields 4902a-4902d using adders 4904a-4904c to compute intermediate fingerprint 4906. Adders 4904a-4904c allow fingerprint circuit 1340a to compute quickly a modulo corresponding to a Mersenne prime, without needing to use slower division circuits to compute the modulo.
To obtain a maximal (or minimal) fingerprint in each parallel fingerprint circuit, each time an intermediate fingerprint is calculated, comparator 1340b can compare the intermediate fingerprint with fingerprint 4916 previously stored in fingerprint buffer 1340c. If intermediate fingerprint 4910 is smaller than buffered fingerprint 4916, the signature computation can replace the fingerprint in fingerprint buffer 1340c using newly computed fingerprint 4910. When all shingles in a 4 KB block have been parsed by the parallel fingerprint computation circuits, the maximal (or minimal) fingerprint can be stored in fingerprint buffer 1340c, as desired. Accordingly, parallel fingerprint circuits can produce maximal (or minimal) fingerprints selected from different random permutations. These fingerprints can comprise a sketch representing the 4 KB data block to be stored in the tag array associated with the data block.
The fingerprint circuit can receive a shingle for processing (step 5022). The fingerprint circuit can generally process multiple shingles of a data block in succession to compute a signature. Furthermore, in some embodiments multiple fingerprint circuits can be arranged in parallel to compute multiple corresponding signatures in parallel for a data block.
Determining a first intermediate fingerprint by processing the received shingle based on Rabin fingerprinting and bit-shifting (step 5004) can include applying a polynomial to the received shingle. The polynomial can include terms or coefficients P1, P2, . . . , Pr−1 to process the received shingle. The polynomial can represent a random irreducible polynomial of the same size as a desired intermediate fingerprint to compute the Rabin fingerprint. Rabin fingerprinting can provide a number of advantages. There can be a lower chance of collisions or conflicts, in which multiple shingles of a given length result in the same hash value even if the multiple shingles represent different contents of data blocks. Additionally, in hardware Rabin fingerprinting can be implemented using shifters and logic gates such as XOR gates, which are relatively fast. Furthermore, when computed over successive shingles, Rabin fingerprinting can leverage previous computations to speed computation of the current intermediate fingerprint. If an example of a received shingle is 64-bits, an example of the first intermediate fingerprint can be 16 bits after processing. If the intermediate fingerprint is desired to be 16 bits, then an example polynomial can be chosen for r=16. The intermediate fingerprint can be bit-shifted by a coefficient Ai to apply a random permutation. In some embodiments, if the fingerprint circuit is repeated in parallel, a different coefficient can be chosen for each i′th fingerprint circuit and different polynomials can be used for each i′th fingerprint circuit.
Determining a second intermediate fingerprint by processing the first intermediate fingerprint based on linear additions with a random constant (step 4926) can include using an adder to add a random coefficient B. In some embodiments, if the fingerprint circuit is repeated in parallel, a different coefficient can be chosen for each i′th fingerprint circuit. In some embodiments, the determining the second intermediate fingerprint can result in a 16-bit intermediate fingerprint for comparison with a previous 16-bit fingerprint in the fingerprint buffer.
Method 5000 can determine whether the intermediate fingerprint is more representative of the overall contents of the block than a previous fingerprint (step 4820). Because the fingerprint circuits can process more than one shingle, the previous fingerprint stored in the fingerprint buffer can be a representative fingerprint for all shingles processed so far by the fingerprint circuit. Some shingles can be expected to result in intermediate fingerprints that are relatively higher or lower. In some embodiments, determining whether the intermediate fingerprint is more representative can include selecting an intermediate or previous fingerprint that is maximal (or minimal) for all shingles processed by the fingerprint circuit. Selecting a maximal or minimal fingerprint can generally result in a better measure of the content of the received block by sampling shingles.
If the second intermediate fingerprint is determined to be more representative of the contents of the block (step 4820: Yes), the second intermediate fingerprint can be stored in the fingerprint buffer as the representative fingerprint (step 4822). For example, determining the intermediate fingerprint to be more or less representative of the contents of the block can include determining whether the intermediate fingerprint is greater than or less than the previous fingerprint, depending on whether a maximal or minimal fingerprint is used for sampling. If the intermediate fingerprint is determined to be more representative, the intermediate fingerprint can therefore replace the previous fingerprint that was initially stored in the fingerprint buffer. If the intermediate fingerprint is determined to be less representative of the contents of the block (step 4820: No), method 5000 can keep the previous fingerprint in the fingerprint buffer as the representative fingerprint for the shingles that have been sampled by the fingerprint circuit (step 4824).
Using Rabin fingerprinting in some embodiments of fingerprint computation circuit 1340a can allow the content locality cache to determine fingerprints based on a property of the block or shingle contents. In general, fingerprint computation circuit 1340a can divide shingle 4806a by a random irreducible polynomial and select the remainder for further use in intermediate fingerprint 5030. As used in Rabin fingerprinting, a random irreducible polynomial can sometimes be referred to as a polynomial that is relatively prime to the input. For example, just as a prime number is not divisible by any other number, input data 5022 is not divisible by the random irreducible polynomial and the random irreducible polynomial is not divisible by input data 5022. Therefore, a remainder can be expected to be generated. Use of Rabin fingerprinting allows the remainder to be generated using a combination of shift registers 5024a-5024d and logic gates such as XOR gates, which perform relatively fast in hardware.
In some embodiments, in an example circuit where r=16, the polynomials can include any eight of the following primitive polynomials implemented as Rabin fingerprinting subcircuit 5038: 210013, 234313, 233303, 307107, 307527, 306357, 201735, 272201, 242413, 270155, 302157, 210205, 305667, 236107. Rabin fingerprinting subcircuit 5038 shows an example subcircuit generated based on a polynomial corresponding to 210013 where Pr−1 . . . P0=(010, 001, 000,000, 001, 011). In other words, the first number in the polynomial is 2 in decimal, which corresponds to 010 in binary for the value of Pr−1. The next number in the polynomial is 1 in decimal, corresponding to 001 in binary for the value of Pr−2, and so on with 0 in decimal=000 in binary, 0 in decimal=000 in binary, 1 in decimal=001 in binary, and 3 in decimal=011 in binary.
Furthermore, using Rabin fingerprinting can provide efficient reuse of previous calculations. For example, as data block 5022 is being transferred over the I/O bus, Rabin fingerprinting subcircuit 5038 can shift shingles from high order bits to low order bits into shift registers 5024a-5024d. For example, data 5022 can be received most significant bit first. Accordingly, when data transfer on the I/O bus is complete, fingerprint calculations for intermediate fingerprint 5030 can also be expected to complete for the received data block. Some embodiments of fingerprint computation circuit 1340a can use XOR gates and flip flops. The selected hardware can speed the resulting fingerprint computation, compared with relatively slower software implementations of the processes described above. In some embodiments, labeled registers 5024a-5024d can represent single bit registers. For better randomness and independence, in some embodiments fingerprint computation circuit 1340a can use different coefficients 5026a-5026c for different parallel fingerprint circuits.
In some embodiments, fingerprint computation circuit 1340a can perform multiplication of a constant using left shift operations such as with shifters 5024a-5024d. This is because left shift operations can be comparatively faster than multiplication operations that can require multiple instructions to complete.
The fingerprint computation can proceed in a similar manner as described in connection with
Fingerprint circuit 1340a can proceed to perform random permutation and minimum-directed (or maximum-directed) sampling. For example, intermediate fingerprint 5030 can perform a random permutation by performing linear transforms based on coefficients Ai and Bi (5028a, 5028b). Specifically, fingerprint circuit 1340a can shift intermediate fingerprint 5030 based on coefficient Ai (5028a). Fingerprint circuit 1340a can use adder 5034 to add in a random term Bi to generate intermediate fingerprint 5036 after the random permutation. In some embodiments, terms 5028a-5028b in the linear transform formula can be chosen differently for corresponding parallel circuits.
To obtain a maximal (or minimal) fingerprint in each parallel fingerprint circuit, when an intermediate fingerprint 5036 is calculated, comparator 1340b can compare intermediate fingerprint 5036 with previous fingerprint 5032 stored in fingerprint buffer 1340c. If intermediate fingerprint 5036 is smaller than buffered fingerprint 5032, the signature computation can replace the fingerprint in fingerprint buffer 1340c using newly computed fingerprint 5036. When all shingles in a 4 KB block have been parsed by the parallel fingerprint computation circuits, the maximal (or minimal) fingerprint can be stored in fingerprint buffer 1340c, as desired. Accordingly, parallel fingerprint circuits can produce maximal (or minimal) fingerprints selected from different random permutations. These fingerprints can comprise a sketch representing the 4 KB data block to be stored in the tag array associated with the data block.
The fingerprint circuit can receive a shingle for processing (step 5022). The fingerprint circuit can generally process multiple shingles of a data block in succession to compute a signature. Furthermore, in some embodiments multiple fingerprint circuits can be arranged in parallel to compute multiple corresponding signatures in parallel for a data block.
Determining a first intermediate fingerprint by processing the received shingle based on Rabin fingerprinting and bit-shifting (step 5004) can include applying a polynomial to the received shingle. The polynomial can include terms or coefficients P1, P2, . . . , Pr−1 to process the received shingle. The polynomial can represent an irreducible polynomial of the same size as a desired intermediate fingerprint. If an example of a received shingle is 64-bits, an example of the first intermediate fingerprint can be 16 bits after processing. If the intermediate fingerprint is desired to be 16 bits, then an example polynomial can be chosen for r=16. In some embodiments, if the fingerprint circuit is repeated in parallel, different coefficients or terms can be chosen for P1, P2, . . . , Pr−1 in each fingerprint circuit.
Speeding the fingerprint processing by sampling a subset of bits from the first intermediate fingerprint (step 5102) can include using a bit mask to sample the subset of bits. An example of a subset of bits from the first intermediate fingerprint can be about 4 bits. If the sampled subset of bits is determined to differ from the bit mask pattern (step 5104: No), method 5100 can process the next shingle, so as to abort fingerprint processing for the current received shingle (step 4826: Yes). In this manner, embodiments of the sampling can speed the fingerprint processing by reducing the number of samples for the fingerprint circuit to process. In other words, some embodiments of the fingerprint circuit can process only fingerprints whose subset of bits matches the sample bit mask.
If the sampled subset of bits is determined to match the bit mask pattern (step 5108: Yes), method 5100 can determine a second intermediate fingerprint based on a remaining subset of bits from the first intermediate fingerprint (step 5108). If an example first intermediate is about 16 bits, the sampling can leave a remaining subset of about 12 bits for further processing as the second intermediate signature. Although intermediate fingerprint sizes of 16 bits and 12 bits are described herein, the sizes of the intermediate fingerprint sizes can vary based on the size of a data block and the contents of the data block. In some embodiments, determining the second intermediate fingerprint can result in a 12-bit second intermediate fingerprint for comparison with a previous 12-bit second intermediate fingerprint in the fingerprint buffer.
Method 5100 can determine whether the second intermediate fingerprint is more representative of the overall contents of the block than a previous fingerprint (step 4820). Because the fingerprint circuits can process more than one shingle, the previous fingerprint stored in the fingerprint buffer can be a representative fingerprint for all shingles processed so far in sequence by the fingerprint circuit. Some shingles can be expected to result in intermediate fingerprints that are relatively higher or lower. In some embodiments, determining whether the intermediate fingerprint is more representative can include selecting an intermediate or previous fingerprint that is maximal (or minimal) for all shingles processed by the fingerprint circuit. Selecting a maximal or minimal fingerprint can generally result in a better measure of the content of the received block by sampling shingles.
If the second intermediate fingerprint is determined to be more representative of the contents of the block (step 4820: Yes), the second intermediate fingerprint can be stored in the fingerprint buffer as the representative fingerprint (step 4822). For example, determining the intermediate fingerprint to be more or less representative of the contents of the block can include determining whether the intermediate fingerprint is greater than or less than the previous fingerprint, depending on whether a maximal or minimal fingerprint is used for sampling. If the intermediate fingerprint is determined to be more representative, the intermediate fingerprint can therefore replace the previous fingerprint that was initially stored in the fingerprint buffer. If the intermediate fingerprint is determined to be less representative of the contents of the block (step 4820: No), method 5100 can keep the previous fingerprint in the fingerprint buffer as the representative fingerprint for the shingles that have been sampled by the fingerprint circuit (step 4824).
In some embodiments, fingerprint circuit 1340a can begin similarly as described in connection with
Rabin fingerprinting subcircuit 5038 can result in intermediate fingerprint 5030. Fingerprint circuit 1340a can use sample bitmask 5110a to mask off, or select, sample bits that match high order bits of intermediate fingerprint 5030. For example, bitmask 5110a can be four bits that match four high order bits of intermediate fingerprint 5030. If logic gate 5112 determines that the high order bits of intermediate fingerprint 5030 match the masked sample bit pattern, fingerprint circuit 1340a can select lower order bits of intermediate fingerprint 5030 as intermediate fingerprint 5114. For example, logic gate 5112 can be a logical AND gate that passes through the low order bits only if the high order bits match bitmask 5110a. In some embodiments, fingerprint circuit 1340a can select the lower order twelve bits of intermediate fingerprint 5030 to determine intermediate fingerprint 5114. If the higher order four bits of intermediate fingerprint 5030 do not match the sample bits encoded in bitmask 5110a, fingerprint circuit 1340a can drop the fingerprint.
In some embodiments, parallel fingerprinting circuits can use different sample bit patterns. For example, if there are eight fingerprinting circuits in parallel, the content locality cache can use eight different bit patterns. Other numbers of parallel fingerprinting circuits can be chosen based on performance needs of the content locality cache. For example, some embodiments of sample bits for use with fingerprint computation circuit 1340a can include s0, s1, s2, s3, . . . =(0000), (1010), (0101), . . . , (0001), or other permutations of bits. For example, sample bitmask 5110b can implement s0=0000 for a first fingerprint computation circuit 1340a, s1=1010 for a second fingerprint computation circuit 1340a, 0101 for a third fingerprint computation circuit 1340a, etc., through 0001 for an eighth fingerprint computation circuit 1340a. Sample bitmask 5010b illustrates how a bitmask can be created based on sample bits. For example, for sample bit pattern (0001), sample bitmask 5110b can accept four inputs, one input corresponding to each bit. Inputs corresponding to logical 0 can enter sample bitmask 5110b directly, such as the leftmost three inputs illustrated in sample bitmask 5110b. Inputs corresponding to logical 1 can enter sample bitmask 5110b via an inverter, or logical not, such as the rightmost input illustrated in sample bitmask 5110b. In this manner, an administrator can create a sample bitmask 5110b gate or circuit corresponding to s0, . . . , s7 as described above. Fingerprint circuit 1340a can then sample fingerprints having high order bits that match the sample bit patterns.
Sampling can result in intermediate fingerprint 5114. After sampling, fingerprint circuit 1340a can compare intermediate fingerprint 5114 with a previously saved fingerprint in fingerprint buffer 5120 to determine whether intermediate fingerprint 5114 is larger (or smaller, depending on whether a maximal or minimal fingerprint is desired). If comparator 1340b determines intermediate fingerprint 5114 to be larger, fingerprint circuit 1340a can save intermediate fingerprint 5114 to fingerprint buffer 5120. Otherwise, fingerprint circuit 1340a can drop intermediate fingerprint 5114 and can keep the previously saved fingerprint in fingerprint buffer 5120.
The fingerprint circuit can receive a shingle for processing (step 5022). The fingerprint circuit can generally process multiple shingles of a data block in succession to compute a signature. Furthermore, in some embodiments multiple fingerprint circuits can be arranged in parallel to compute multiple corresponding signatures in parallel for a data block.
Determining a first intermediate fingerprint by processing the received shingle based on a random irreducible polynomial (step 5202) can include applying a random irreducible polynomial to the received shingle. In some embodiments, the random irreducible polynomial can be chosen based on a polynomial of a prime number p so as to be irreducible relative to the received shingle. Examples of random irreducible polynomials can include F1=(b1*p7+b2*p6+b3*p5+b4*p4+b5*p3+b6*p2+b7*p1+b8)mod M, where bi denotes the i′th byte string of the shingle and p and M are constants. For example,
In some embodiments, determining the first intermediate fingerprint by processing the received shingle based on a random irreducible polynomial (step 5202) can include using fast table lookups to speed computation of the random irreducible polynomial. For example, a lookup table in the fingerprint circuit can pre-compute and store possible values of bi*p8. Therefore, when the fingerprint circuit determines Fi+1 based on Fi, the value of bi*p8 used in the formula can be performed via a relatively faster table lookup rather than a relatively slower multiplication or left shift.
Speeding the fingerprint processing by sampling a subset of bits from the first intermediate fingerprint (step 5204) can include using a bit mask to sample the subset of bits. An example of a subset of bits from the first intermediate fingerprint can be about 4 bits such as the lower order 4 bits. If the sampled subset of bits is determined to differ from the bit mask pattern (step 5104: No), method 5230 can abort the fingerprint processing for the received shingle (step 5106). In this manner, embodiments of the sampling can speed the fingerprint processing by reducing the number of intermediate fingerprints or samples for the fingerprint circuit to process. In other words, some embodiments of the fingerprint circuit can process only intermediate fingerprints whose subset of bits matches the sample bit mask.
If the sampled subset of bits is determined to match the bit mask pattern (step 5108: Yes), method 5230 can determine a second intermediate fingerprint based on a remaining subset of bits from the first intermediate fingerprint (step 5108). If an example first intermediate is about 16 bits, the sampling can leave a remaining subset of about 12 bits for further processing as the second intermediate signature. Although intermediate fingerprint sizes of 16 bits and 12 bits are described herein, the sizes of the intermediate fingerprint sizes can vary based on the size and/or contents of a data block. In some embodiments, determining the second intermediate fingerprint can result in a 12-bit second intermediate fingerprint for comparison with a previous 12-bit second intermediate fingerprint in the fingerprint buffer.
Method 5230 can determine whether the second intermediate fingerprint is more representative of the overall contents of the block than a previous fingerprint (step 4820). Because the fingerprint circuits can process more than one shingle, the previous fingerprint stored in the fingerprint buffer can be a representative fingerprint for all shingles processed so far in sequence by the fingerprint circuit. Some shingles can be expected to result in intermediate fingerprints that are relatively higher or lower. In some embodiments, determining whether the intermediate fingerprint is more representative can include selecting an intermediate or previous fingerprint that is maximal (or minimal) for all shingles processed by the fingerprint circuit. Selecting a maximal or minimal fingerprint can generally result in a better measure of the content of the received block by sampling shingles.
If the second intermediate fingerprint is determined to be more representative of the contents of the block (step 4820: Yes), the second intermediate fingerprint can be stored in the fingerprint buffer as the representative fingerprint (step 4822). For example, determining the intermediate fingerprint to be more or less representative of the contents of the block can include determining whether the intermediate fingerprint is greater than or less than the previous fingerprint, depending on whether a maximal or minimal fingerprint is used for sampling. If the intermediate fingerprint is determined to be more representative, the intermediate fingerprint can therefore replace the previous fingerprint that was initially stored in the fingerprint buffer. If the intermediate fingerprint is determined to be less representative of the contents of the block (step 4820: No), method 5230 can keep the previous fingerprint in the fingerprint buffer as the representative fingerprint for the shingles that have been sampled by the fingerprint circuit (step 4824).
In some embodiments, fingerprint circuit 1304a can generate a random irreducible polynomial such as polynomial 5202a for a shingle of data 5212. Further description on generating random irreducible polynomials for each shingle is disclosed in Udi Manber, “Finding Similar Files in a Large File System,” 1994 USENIX Tech Conference, the entire contents of which are incorporated by reference herein.
Polynomial 5202a can denote the byte string corresponding to data 5212 by b1, b2, b3, . . . , bn. In some embodiments, taking the shingle size to be eight bytes, fingerprint circuit 1340a can determine intermediate fingerprint 5220 to be:
F
1=(b1*p7+b2*p6+b3*p5+b4*p4+b5*p3+b6*p2+b7*p1+b8)mod M
where p and M are constants. For example, fingerprint circuit 1340a illustrates an example in which p=7 with a shingle size of eight bytes. In general, p can be any prime number. Constant M can be determined based on fingerprint length. For example,
In some embodiments, fingerprint circuit 1340a can use Horner's formula to calculate F1 in polynomial 5202a:
F
1=(p·(( . . . (p·(·d b1+b2)+b3) . . . ))+b8)mod M.
Furthermore, fingerprint circuit 1340a can calculate second fingerprint F2 (5202b) based on fingerprint F1 (5202a) and adder 5206 as follows:
F
2=(p*(F1−(b1*p7))+b9)mod M
The result of adder 5206 can be stored in intermediate fingerprint 5220. In some embodiments, intermediate fingerprint 5220 can be sixteen bits. Some embodiments of fingerprint circuit 1340a can calculate fingerprints recursively for the rest of the shingles.
In some embodiments, fingerprint circuit 1340a can precompute possible values of bi*p8, and store the precomputed values in lookup table 5204. For example, fingerprint circuit 1340a can precompute all 256 possible values of bi*p8. During signature computation, in some embodiments fingerprint circuit 1340a can look up in lookup table 5204 to find a desired value corresponding to a current byte value under analysis. Fingerprint circuit 1340a can then perform addition using adder 5206 to obtain intermediate fingerprint 5220. In some embodiments, intermediate fingerprint 5220 can be sixteen bits.
Fingerprint circuit 1340a can use sample bitmask 5110a to mask off, or select, sample bits that match low order bits of intermediate fingerprint 5220. For example, bitmask 5110a can be four bits that match four low order bits of intermediate fingerprint 5220. If logic gate 5210 determines that the low order bits of intermediate fingerprint 5220 match the masked sample bit pattern, fingerprint circuit 1340a can select higher order bits of intermediate fingerprint 5220 as intermediate fingerprint 5222. For example, logic gate 5210 can be a logical AND gate that passes through the higher order bits only if the lower order bits match bitmask 5110a. In some embodiments, fingerprint circuit 1340a can select the higher order twelve bits of intermediate fingerprint 5220 to determine intermediate fingerprint 5222. If the lower order four bits of intermediate fingerprint 5030 do not match the sample bits encoded in bitmask 5110a, fingerprint circuit 1340a can drop the fingerprint.
In some embodiments, parallel fingerprinting circuits can use different sample bit patterns. For example, if there are eight fingerprinting circuits in parallel, the content locality cache can use eight different bit patterns. Other numbers of parallel fingerprinting circuits can be chosen based on performance needs of the content locality cache. For example, some embodiments of sample bits for use with fingerprint computation circuit 1340a can include s0, s1, s2, s3, . . . (0000), (1010), (0101), . . . , (0001), or other permutations of bits. For example, sample bitmask 5110b can implement s0=0000 for a first fingerprint computation circuit 1340a, S1=1010 for a second fingerprint computation circuit 1340a, 0101 for a third fingerprint computation circuit 1340a, etc., through 0001 for an eighth fingerprint computation circuit 1340a. Sample bitmask 5010b illustrates how a bitmask can be created based on sample bits. For example, for sample bit pattern (0001), sample bitmask 5110b can accept four inputs, one input corresponding to each bit. Inputs corresponding to logical 0 can enter sample bitmask 5110b directly, such as the leftmost three inputs illustrated in sample bitmask 5110b. Inputs corresponding to logical 1 can enter sample bitmask 5110b via an inverter, or logical not, such as the rightmost input illustrated in sample bitmask 5110b. In this manner, an administrator can create a sample bitmask 5110b gate or circuit corresponding to s0, s7 as described above. Fingerprint circuit 1340a can then sample fingerprints having high order bits that match the sample bit patterns.
In some embodiments, sampling can result in intermediate fingerprint 5222. For example, intermediate fingerprint 5222 can be twelve bits after the low order four bits have been masked off. After sampling, fingerprint circuit 1340a can compare intermediate fingerprint 5222 with a previously saved fingerprint in fingerprint buffer 5208 using comparator 1340b to determine whether intermediate fingerprint 5222 is larger (or smaller, depending on whether a maximal or minimal fingerprint is desired). If comparator 1340b determines intermediate fingerprint 5222 to be larger, fingerprint circuit 1340a can save intermediate fingerprint 5222 to fingerprint buffer 5208. Otherwise, fingerprint circuit 1340a can drop intermediate fingerprint 5222 and can keep the previously saved fingerprint in fingerprint buffer 5208. The resulting fingerprint stored in fingerprint buffer 1340c can be used as part of a sketch of a data block corresponding to data 5212.
Periodically, the content locality cache can use scan logic to scan independent blocks in the background, to identify new reference blocks and associated delta blocks. In some embodiments, during each scan cycle the scan logic can iterate over independent blocks starting with most recently used blocks to least recently used blocks. For each block, the content locality cache can accumulate a popularity measure for the block by adding popularity values corresponding to fingerprints of a related sketch. If the popularity exceeds a predetermined threshold, the independent block may become a reference candidate. The reference candidate blocks can then participate in similarity detection to identify associated blocks that can be delta compressed to small enough deltas. During the scan process, in some embodiments RAM cache can be used as temporary storage. For example, the RAM can store intermediate data until blocks are classified and stored in their respective data area in the nonvolatile data array.
While selecting reference blocks, one consideration is that distance in terms of similarity between any two reference blocks 5202 be selected to be large enough so that each reference block forms a center of cluster surrounded by associated blocks 5204a, 5204b. This consideration can have a direct impact on I/O performance in addition to content popularity. For example, let blocks R3 (reference block) and A3 (associated block) both have a high popularity value, and further assume R3 and A3 are very similar in content. The content locality cache can select one block as a reference block (e.g., R3) while selecting the other block as an associated delta block (e.g., A3). In contrast, if both R3 and A3 were classified as reference blocks, the number of associated blocks would be much smaller than identifying blocks R3 and R2 as reference blocks. This is because reference block R2 could be far away from reference block R3. Selecting reference blocks with an appropriate distance in similarity may give rise to larger numbers of possible associated blocks.
In some embodiments, the periodical scanning can be triggered either after a fixed number of I/O operations or a fixed amount of time. For example, the scanning can be triggered after a predetermined threshold number of I/O operations, e.g., 20,000 I/O operations. Therefore, the content locality cache can use a counter or timer/idle detector 1334 for this purpose (shown in
Eviction logic may identify cached blocks to evict by updating a least recently used (LRU) counter in a status bit field of a tag array corresponding to each cached block. For example, upon a cache miss, the LRU counter of the newly cached block may be set to a maximal value. All LRU counters corresponding to data blocks in cache may be decremented by 1. Upon a cache hit, the LRU counter of the hit block may be set to maximal. LRU counters of other blocks that were smaller than the original LRU value of the accessed block may be decremented by 1. In this way, the system preserves a least recently used (LRU) ordering of all cached data blocks. For example, if a set size of the cache is 1 MB, the systems may use 8-bit LRU counters for a block size of 4 KB. For larger set sizes, the systems may use longer LRU counters. For example, a 32-bit LRU counter may be able to accommodate a set size of up to 16 TB.
When cache is full, a cache miss may trigger eviction of another cached block to make room for caching the missed block. For example, the eviction logic may select the cache block with the lowest LRU counter value. If the selected block is an independent block, the systems can simply replace the independent block and write the independent block back to primary storage if the independent block is in dirty state. If the selected block is an associated delta block in dirty state, the eviction logic may trigger a decompression operation with respect to the reference block identified by a reference pointer of the associated delta block. After the decompression, the recreated block may be written back to the primary storage. Lastly, if the selected block is a reference block, the eviction logic may find all related associated blocks with matching reference pointers. For example, the eviction logic may perform an associative search in the tag array to identify the related associated blocks. All such matched associated blocks may be evicted together with the reference block. In practice, eviction of a reference block is expected to be a rare event because any time an associated block is accessed by an I/O operation, the corresponding reference block is also accessed. Therefore, reference blocks exhibit a much higher chance to be on top of the LRU list compared with other blocks. If a reference block ends up falling down to the bottom of the LRU list, in practice the chances are that the corresponding associated blocks in the cache are no longer active. I.e., the corresponding associated blocks have not been referenced by I/O operations for a long time. These corresponding associated blocks have therefore either already been evicted from the cache, or should be evicted.
Average I/O access time with the content locality cache may be expressed by
T
Ave
=H
R
* T
H+(1−HR)*TM (8)
where HR represents a cache hit ratio, TH represents an access time upon a cache hit, and TM represents access time upon a cache miss. Graphs 5402, 5404 illustrate expected hardware speedup as a function of HR. The present disclosure first derives a number of equations for representing hardware speedup as a function of HR and other factors, and then applies the equations to explain graphs 5402, 5404.
In some embodiments, whenever the I/O request rate gets high and approaches the I/O service rate, there may be a queuing effect to queue requests for servicing. In this case, analysis of average I/O access time may increase in complexity. One simplification may be to assume that both request process and service process follow a Poisson distribution (i.e., a probabilistic memoryless distribution). With this simplification, average I/O access time may be given with simplified formulas as follows.
Let the I/O request rate, i.e., the number of I/O requests received by the storage system per second, be represented by λ. Let the service rate, i.e., the number of I/Os served by the storage system per second, be represented by μ. If the disk access time is assumed to be 10 ms, then t=1/10 ms=100 IOPS (I/O operations per second). With cache, if the average I/O service latency is 500 us (microseconds), then μ=1/500 us=2,000 IOPS. Traffic intensity, or queue utilization (i.e., the proportion of time that primary storage is busy), ρ, may thereby be given by
ρ=λ/μ, where ρ is expected to be less than 1
Average I/O time including queuing delay may then be given by (M/M/1 queue)
Accordingly, service rate μ may become
μ=1/TAve=1/(HR*TH+(1−HR)*TM). (10)
When μ is close to λ, I/O latency may become large. Therefore, the content locality cache may benefit from limiting I/O latency by maximizing HR while minimizing TH and TM to keep the systems stable.
Returning to graphs 5402, 5404,
One interesting note is that the hardware speedup changes from high to low and then to high again when cache hit ratio increases from 70% to 98% as illustrated in graphs 5302, 5304. The reason for these speedup changes may be explained herein. At lower cache hit ratio, average I/O access time calculated using Equation (8) is large, resulting in the service rate μ (Equation (10)) being close to the host I/O request rate λ. As a result, queuing delay may become large and therefore any latency improvement can result in great performance gain. As hit ratio HR increases, the queuing effect reduces because the service rate μ increases with respect to the fixed request rate λ. However, as the hit ratio increases further, the cache access time becomes a significant portion of the total I/O time. Therefore, the hardware speedup increases again as shown in graphs 5402, 5404.
Graphs 5502, 5504 illustrate that queuing effect ρ is not significant because host request rate λ may be much smaller than service rate μ of the example storage system. Similar to
As the SSD latency reduces following the technology trend, the content locality caching is likely to show increasing performance advantages. To quantitatively analyze such trend, graphs 5602, 5604 plot expected hardware speedup for different SSD access times. Graphs 5602, 5604 illustrate hypothetically decreasing SSD access latency for both hardware implementations and software implementations. Graphs 5602, 5604 keep all other parameters similar to the parameters illustrated in graphs 5502, 5504 (shown in
In virtualized environments such as environments running multiple virtual machines (VMs), storage I/O has become a performance bottleneck. Reasons include: (1) multiple VMs may share primary storage, which may cause the primary storage to be a bottleneck, and (2) aggregated I/O operations from multiple virtual machines may appear mostly random from the perspective of the primary storage. First, multiple virtual machines (VMs) on a hypervisor may share storage I/O devices. A hypervisor refers to a separate “virtual machine monitor” running on the system that manages operation of multiple VMs. Each VM may have its own OS image and application environment stored on primary data stores. These OS images and application environments may create a burden of I/O contention, thereby causing bottlenecks at primary storage. Second, although I/O operation streams of individual VMs may show some spatial locality with sequential I/O operations, aggregated I/O operations from the perspective of the storage device may appear mostly randomized. Accordingly, the primary storage may perform poorly, exacerbating adverse bottleneck effects.
Graph 5702 illustrates that the content locality based cache may be expected to improve VM performance when compared to traditional cache solutions. Accordingly, the content locality based cache may support more VMs on a single hypervisor. The content locality based cache may boost VM performance in two independent ways: (1) decreasing latencies of random I/O operations, and (2) exploiting content locality of OS images and application code, in addition to content locality of data. First, effectively caching hot data in SSD may be expected to decrease latencies dramatically of random I/O operations, by eliminating a number of random seeks and rotation delays associated with HDDs. Second, OS images and application code of multiple VMs running on the hypervisor may be mostly similar in data content. Therefore, OS iamges and application code may also be expected to benefit from content locality. The systems may further take advantage of content locality of data being accessed. As a result, the caching may reduce active data footprints stored in SSD cache, which may increase cache efficiency. If the content locality cache is implemented in hardware, some embodiments may omit the need to have special software running on a hypervisor (except for generic driver software for the hardware). In other words, the corresponding caching functions may be offloaded and run on a hardware implementation, a custom ASIC, or firmware on the primary storage device.
Graphs 5802, 5804 analyze potential or expected benefits of a hardware implementation in virtual environments having multiple virtual machines (VMs). For example, suppose that each VM running on a hypervisor using content locality caching requires certain IOPS to run with a maximally tolerable I/O latency, TMax. With these I/O constraints, graphs 5702, 5704 illustrate an example analysis of how many VMs the hypercache can support in a hardware implementation (No_VM_HW) and in a software implementation (No_VM_SW). Let NVM be the number of VMs that can run on the hypervisor with the above I/O requirements. Equation (9) results in:
T
Max
=T
total=1/(1/TAve−NVM*IOPS),
which leads to:
N
VM=(TMax−TAve)/(TMax*TAve*IOPS). (11)
Based on Equation (11),
Furthermore, some embodiments may also reduce memory pressures that many VMs experience, by offloading content locality based cache functions to a hardware implementation. VM companies have proposed techniques for reducing memory pressure such as ballooning, page sharing, and swapping. Regardless of such techniques, available physical memory to VMs remains a limiting factor for the number of VMs that a hypervisor can support. If content locality based caching is implemented in software, the cache can require at least some amount of memory, thereby competing with memory available to VMs. Therefore, offloading content locality based caching to a hardware implementation on a storage device may be expected to increase the number of VMs that can run on a hypervisor on the same server hardware.
The present disclosure has presented a hardware implementation of a cache design using solid state drive (SSD) technology that exploits content locality, temporal locality, and spatial locality of I/O operations. The hardware implementation may be easily implemented using simple hardware and intelligent processing units. Many caching functions may be carried out in parallel to normal I/O processes, with minimal overhead. In addition to effective cache functions, content locality based caching also offers advantages in terms of increasing SSD endurance, data reduction as a superset of deduplication, and excellent scalability for clusters of servers. The present disclosure has described approximate analysis that shows expected benefits of offloading caching to hardware implementations. The performance improvement of offloading the cache functions may be expected to be significant due to high speed hardware that manages caching in parallel to applications running on host. Furthermore, in some embodiments the overall performance gain of implementing caching on hardware may be amplified in virtualized environments, leading to increased number of virtual machines that can be supported and corresponding high I/O performance.
The methods and systems described herein may be deployed in part or in whole through a machine that executes computer software, program codes, and/or instructions on a processor. The processor may be part of a server, client, network infrastructure, mobile computing platform, stationary computing platform, or other computing platform. A processor may be any kind of computational or processing device capable of executing program instructions, codes, binary instructions and the like. The processor may be or include a signal processor, digital processor, embedded processor, microprocessor or any variant such as a co-processor (math co-processor, graphic co-processor, communication co-processor and the like) and the like that may directly or indirectly facilitate execution of program code or program instructions stored thereon. In addition, the processor may enable execution of multiple programs, threads, and codes. The threads may be executed simultaneously to enhance the performance of the processor and to facilitate simultaneous operations of the application. By way of implementation, methods, program codes, program instructions and the like described herein may be implemented in one or more thread. The thread may spawn other threads that may have assigned priorities associated with them; the processor may execute these threads based on priority or any other order based on instructions provided in the program code. The processor may include memory that stores methods, codes, instructions and programs as described herein and elsewhere. The processor may access a storage medium through an interface that may store methods, codes, and instructions as described herein and elsewhere. The storage medium associated with the processor for storing methods, programs, codes, program instructions or other type of instructions capable of being executed by the computing or processing device may include but may not be limited to one or more of a CD-ROM, DVD, memory, hard disk, flash drive, RAM, ROM, cache and the like.
A processor may include one or more cores that may enhance speed and performance of a multiprocessor. In embodiments, the process may be a dual core processor, quad core processors, other chip-level multiprocessor and the like that combine two or more independent cores (called a die).
The methods and systems described herein may be deployed in part or in whole through a machine that executes computer software on a server, client, firewall, gateway, hub, router, or other such computer and/or networking hardware. The software program may be associated with a server that may include a file server, print server, domain server, internet server, intranet server and other variants such as secondary server, host server, distributed server and the like. The server may include one or more of memories, processors, computer readable media, storage media, ports (physical and virtual), communication devices, and interfaces capable of accessing other servers, clients, machines, and devices through a wired or a wireless medium, and the like. The methods, programs or codes as described herein and elsewhere may be executed by the server. In addition, other devices required for execution of methods as described in this application may be considered as a part of the infrastructure associated with the server.
The server may provide an interface to other devices including, without limitation, clients, other servers, printers, database servers, print servers, file servers, communication servers, distributed servers and the like. Additionally, this coupling and/or connection may facilitate remote execution of program across the network. The networking of some or all of these devices may facilitate parallel processing of a program or method at one or more location without deviating from the scope. In addition, any of the devices attached to the server through an interface may include at least one storage medium capable of storing methods, programs, code and/or instructions. A central repository may provide program instructions to be executed on different devices. In this implementation, the remote repository may act as a storage medium for program code, instructions, and programs.
The software program may be associated with a client that may include a file client, print client, domain client, internet client, intranet client and other variants such as secondary client, host client, distributed client and the like. The client may include one or more of memories, processors, computer readable media, storage media, ports (physical and virtual), communication devices, and interfaces capable of accessing other clients, servers, machines, and devices through a wired or a wireless medium, and the like. The methods, programs or codes as described herein and elsewhere may be executed by the client. In addition, other devices required for execution of methods as described in this application may be considered as a part of the infrastructure associated with the client.
The client may provide an interface to other devices including, without limitation, servers, other clients, printers, database servers, print servers, file servers, communication servers, distributed servers and the like. Additionally, this coupling and/or connection may facilitate remote execution of program across the network. The networking of some or all of these devices may facilitate parallel processing of a program or method at one or more location without deviating from the scope. In addition, any of the devices attached to the client through an interface may include at least one storage medium capable of storing methods, programs, applications, code and/or instructions. A central repository may provide program instructions to be executed on different devices. In this implementation, the remote repository may act as a storage medium for program code, instructions, and programs.
The methods and systems described herein may be deployed in part or in whole through network infrastructures. The network infrastructure may include elements such as computing devices, servers, routers, hubs, firewalls, clients, personal computers, communication devices, routing devices and other active and passive devices, modules and/or components as known in the art. The computing and/or non-computing device(s) associated with the network infrastructure may include, apart from other components, a storage medium such as flash memory, buffer, stack, RAM, ROM and the like. The processes, methods, program codes, instructions described herein and elsewhere may be executed by one or more of the network infrastructural elements.
The methods, programs codes, and instructions described herein and elsewhere may be implemented on or through mobile devices. The mobile devices may include navigation devices, cell phones, mobile phones, mobile personal digital assistants, laptops, palmtops, netbooks, pagers, electronic books readers, music players and the like. These devices may include, apart from other components, a storage medium such as a flash memory, buffer, RAM, ROM and one or more computing devices. The computing devices associated with mobile devices may be enabled to execute program codes, methods, and instructions stored thereon. Alternatively, the mobile devices may be configured to execute instructions in collaboration with other devices. The mobile devices may communicate with base stations interfaced with servers and configured to execute program codes. The mobile devices may communicate on a peer to peer network, mesh network, or other communications network. The program code may be stored on the storage medium associated with the server and executed by a computing device embedded within the server. The base station may include a computing device and a storage medium. The storage device may store program codes and instructions executed by the computing devices associated with the base station.
The computer software, program codes, and/or instructions may be stored and/or accessed on machine readable media that may include: computer components, devices, and recording media that retain digital data used for computing for some interval of time; semiconductor storage known as random access memory (RAM); mass storage typically for more permanent storage, such as optical discs, forms of magnetic storage like hard disks, tapes, drums, cards and other types; processor registers, cache memory, volatile memory, non-volatile memory; optical storage such as CD, DVD; removable media such as flash memory (e.g. USB sticks or keys), floppy disks, magnetic tape, paper tape, punch cards, standalone RAM disks, Zip drives, removable mass storage, off-line, and the like; other computer memory such as dynamic memory, static memory, read/write storage, mutable storage, read only, random access, sequential access, location addressable, file addressable, content addressable, network attached storage, storage area network, bar codes, magnetic ink, and the like.
The methods and systems described herein may transform physical and/or or intangible items from one state to another. The methods and systems described herein may also transform data representing physical and/or intangible items from one state to another.
The elements described and depicted herein, including in flow charts and block diagrams throughout the figures, imply logical boundaries between the elements. However, according to software or hardware engineering practices, the depicted elements and the functions thereof may be implemented on machines through computer executable media having a processor capable of executing program instructions stored thereon as a monolithic software structure, as standalone software modules, or as modules that employ external routines, code, services, and so forth, or any combination of these, and all such implementations may be within the scope of the present disclosure. Examples of such machines may include, but may not be limited to, personal digital assistants, laptops, personal computers, mobile phones, other handheld computing devices, medical equipment, wired or wireless communication devices, transducers, chips, calculators, satellites, tablet PCs, electronic books, gadgets, electronic devices, devices having artificial intelligence, computing devices, networking equipment, servers, routers and the like. Furthermore, the elements depicted in the flow chart and block diagrams or any other logical component may be implemented on a machine capable of executing program instructions. Thus, while the foregoing drawings and descriptions set forth functional aspects of the disclosed systems, no particular arrangement of software for implementing these functional aspects should be inferred from these descriptions unless explicitly stated or otherwise clear from the context. Similarly, it may be appreciated that the various steps identified and described above may be varied, and that the order of steps may be adapted to particular applications of the techniques disclosed herein. All such variations and modifications are intended to fall within the scope of this disclosure. As such, the depiction and/or description of an order for various steps should not be understood to require a particular order of execution for those steps, unless required by a particular application, or explicitly stated or otherwise clear from the context.
The methods and/or processes described above, and steps thereof, may be realized in hardware, software or any combination of hardware and software suitable for a particular application. The hardware may include a general purpose computer and/or dedicated computing device or specific computing device or particular aspect or component of a specific computing device. The processes may be realized in one or more microprocessors, microcontrollers, embedded microcontrollers, programmable digital signal processors or other programmable device, along with internal and/or external memory. The processes may also, or instead, be embodied in an application specific integrated circuit, a programmable gate array, programmable array logic, or any other device or combination of devices that may be configured to process electronic signals. It may further be appreciated that one or more of the processes may be realized as a computer executable code capable of being executed on a machine readable medium.
The computer executable code may be created using a structured programming language such as C, an object oriented programming language such as C++, or any other high-level or low-level programming language (including assembly languages, hardware description languages, and database programming languages and technologies) that may be stored, compiled or interpreted to run on one of the above devices, as well as heterogeneous combinations of processors, processor architectures, or combinations of different hardware and software, or any other machine capable of executing program instructions.
Thus, in one aspect, each method described above and combinations thereof may be embodied in computer executable code that, when executing on one or more computing devices, performs the steps thereof. In another aspect, the methods may be embodied in systems that perform the steps thereof, and may be distributed across devices in a number of ways, or all of the functionality may be integrated into a dedicated, standalone device or other hardware. In another aspect, the means for performing the steps associated with the processes described above may include any of the hardware and/or software described above. All such permutations and combinations are intended to fall within the scope of the present disclosure.
While the methods and systems described herein have been disclosed in connection with some embodiments shown and described in detail, various modifications and improvements thereon may become readily apparent to those skilled in the art. Accordingly, the spirit and scope of the methods and systems described herein is not to be limited by the foregoing examples, but is to be understood in the broadest sense allowable by law.
All documents referenced herein are hereby incorporated by reference.
This application is a continuation-in-part of U.S. patent application Ser. No. 13/615,422, filed Sep. 13, 2012, which further claims the benefit of U.S. Provisional Patent Application Ser. No. 61/534,915 filed Sep. 15, 2011; and U.S. Provisional Patent Application Ser. No. 61/533,990 filed Sep. 13, 2011. U.S. patent application Ser. No. 13/615,422 is a continuation of U.S. patent application Ser. No. 13/366,846, filed Feb. 6, 2012, which further claims the benefit of U.S. Provisional Patent Application Ser. No. 61/497,549 filed Jun. 16, 2011; U.S. Provisional Patent Application Ser. No. 61/447,208 filed Feb. 28, 2011; and U.S. Provisional Patent Application Ser. No. 61/441,976 filed Feb. 11, 2011; U.S. patent application Ser. No. 13/366,846 is a continuation of U.S. patent application Ser. No. 12/762,993 filed Apr. 19, 2010, which further claims the benefit of U.S. Provisional Patent Application Ser. No. 61/174,166 filed Apr. 30, 2009. The entire contents of each application are incorporated by reference herein.
Number | Date | Country | |
---|---|---|---|
61174166 | Apr 2009 | US | |
61534915 | Sep 2011 | US | |
61533990 | Sep 2011 | US | |
61497549 | Jun 2011 | US | |
61447208 | Feb 2011 | US | |
61441976 | Feb 2011 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 13366846 | Feb 2012 | US |
Child | 13615422 | US | |
Parent | 12762993 | Apr 2010 | US |
Child | 13366846 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 13615422 | Sep 2012 | US |
Child | 14332113 | US |