This application claims priory to U.S. provisional patent application Ser. No. 61/144,404 filed Jan. 13, 2009 and U.S. provisional patent application Ser. No. 61/144,395 filed Jan. 13, 2009 which are both herein incorporated by reference in their entirety.
Prefetching is a caching technique used to improve the performance of disk and memory systems. Whereas nominal caching increases performance by keeping copies of accessed data in the hope that it will be accessed again, prefetching loads the caching memory before an access to data in the hope it will be accessed soon. The information required for a successful prefetch includes:
The effectiveness of a disk prefetch is dependent on the correct prediction of the future read patterns over the disk. Predictions can be based on guesses or historical observation. An example of a typical guess involves the concept of “spacial locality” which predicts that a future read is likely to occur in proximity by address to the last read. Historical observations involve recognizing patterns of access, such as address B always follows A and address C always follows B.
If the wrong data is prefetched, no accesses to the data will occur and no performance improvements will be realized. Likewise, if the right data is fetched at the wrong time, it may be replaced by other caching data before the access occurs. Incorrectly specifying the “keep time” will have a similar effect.
In a storage system, defining a prefetch sequence and effectively guessing what future data accesses will be, is a computationally intensive and sometimes intractable task.
The complexity and computational requirements of tracking access patterns to a storage device is simplified so that a determination of data prefetch suitability can be made in real-time rather than after extensive analysis.
Referring to
In another embodiment, the client 10 may be a processor in a personal computer that accesses one or more disks 20 over an internal or external data bus. The storage system 14 in this embodiment could be located in the personal computer or server 10, or could also be a stand-alone device coupled to the computer/client 10 via a computer bus or packet switched network connection, such as a Small Computer System Interface (SCSI) connection.
The storage system 14 accepts reads and writes to disk 20 from client 10 and contains a tiering memory or media 16 used for accelerating the client 10 accesses to disk 20. In one embodiment, the tiering memory 16 could be any combination of Dynamic Random Access Memory (DRAM) and/or Flash memory. Of course, the tiering memory 16 could be implemented with any combination of memory devices that provide relatively faster data access than the disk 20.
A prefetch controller 18 includes any combination of software and/or hardware within storage system 14 that controls tiering memory 16. For example, the prefetch controller 18 could be a processor that executes software instructions to provide the prefetch operations. The prefetch controller 18 determines what data to prefetch, when to prefetch the data, and how long to store the prefetch data in tiering memory 16.
During a prefetch operation, controller 18 receives a storage access request from client 10. The controller 18 accesses the corresponding address in disk 20 and stores the data in tiering memory 16. The prefetch controller 18 also prefetches other data from disk 20 that is likely to be subsequently accessed by the client 10. If subsequent reads or writes from client 10 are for the data prefetched into tiering memory 16, storage system 14 returns the data directly from tiering memory 16. Such a direct return from faster tiering memory 16 to client 10 is referred to as a “hit” and improves the performance of applications running on client 10. For example, a memory access to disk 20 can take several milliseconds while a memory access to tiering memory 16 may be in the order of microseconds.
Prefetch controller 18 can operate in both a monitoring mode and an active mode. During the monitoring mode, the prefetch controller 18 records and analyzes read and/or write disk access operations in input stream 300 from client 10 to disk 20. The prefetch controller 18 then uses the monitored information to construct heuristics or histograms for performing subsequent tiering operations. When sufficient information has been gathered, the prefetch controller 18 switches from the monitoring mode to an active mode. The active mode prefetches data from disk 20 into tiering memory 16 according to the heuristics and histograms obtained during the monitoring mode. In another embodiment, prefetch controller 18 always operates in active mode and implements a default set of tiering operations (such as always prefetching from the next storage address) until the heuristics determine a better strategy.
Recording of disk accesses is performed by maintaining a log of the time, data address (location of the read or write), and length of the operation (number of addresses to read or write within one command). The address is often expressed in terms of blocks (such as a read of blocks 100-200) where disk 20 is viewed as a large contiguous region of blocks. The length of the disk access operation is similarly expressed as a number of blocks. Thus, every read or write from client 10 to disk 20 can be viewed as affecting a block range (from address to address plus length).
To reduce complexity of the analysis stage and therefore computational time, the read operations from the clients 10 to the disk array 18 are compressed to a reduced version in a way that does not degrade the quality of prefetch predictions. This reduction reduces the computation requirements of the subsequent prediction.
Of course, this is just one example of how physical blocks of memory in disk 20 can be logically mapped to different logical regions. For example, each region may not necessarily contain 100 blocks and different regions may contain different numbers of blocks. Schemes used by the prefetch controller 18 for deriving the number and size of logical regions in disk 20 are described in U.S. patent application Ser. No. 12/605,119, entitled: STORAGE DEVICE PREFETCH SYSTEM USING DIRECTED GRAPH CLUSTERS, filed Oct. 23, 2009 and U.S. patent application Ser. No. 12/619,609, entitled: CLUSTER CONTROL PROTOCOL, filed Nov. 16, 2009 which are both herein incorporated by reference in their entirety.
A relatively longer time period exists between time t5 and time t6. This is represented by the jagged line on the right side of
In this example, memory access time to region 1 would be substantially improved by prefetching the entire region 1 of blocks 0-100 into tiering media 16 whenever there is an initial read to blocks 0-20. This is because all other blocks 21-100 are likely to subsequently be read by the client 10 after reading blocks 0-20. Prefetching blocks 21-100 into the faster tiering media 16 allows the storage system 14 to then provide faster access to the blocks in region 1 via tiering media 16 instead of slower access via the slower disk memory 20. Further, all of the blocks in region 1 are likely to be accessed within the relatively short time period between time t1 and time t5. Thus, none of the blocks in region 1 are unnecessarily stored in the tiering media 18 when the prefetch of region 1 is performed.
Only half of the prefetched blocks in region 1 could be used if all blocks 0-100 were prefetched into tiering media 16. For example, too much time exists between time t3 and t4 to keep all of the blocks in region 1 in tiering media 16 until time t5. Thus, the blocks in region 1 would have to be removed from tiering media 16 before blocks 21-40 and 81-100 could be read. Blocks 21-40 and 81-100 would then have to be reread from disk 20 at time t4.
The prefetch controller 18 identifies a timestamp, starting address, and read length for each read operation 302 in the storage accesses observed in input stream 300. For example, the prefetch controller for the read operation 302A assigns “Time A, Read 50-70”. The letter A represents a timestamp value, the value 50 refers to the starting block address for the read operation, and the value 50-70 refers to a read length of 20 blocks.
The prefetch controller 18 uses the starting address and read length to map each read operation 302 to a logical region using logic referred to as region mapping system 100. In one embodiment, region mapping system 100 is statically programmed utilizing previous read pattern analysis. This static programming is contained within a stored configuration register 200.
In an alternative embodiment, region mapping system 100 dynamically modifies configuration register 200 based on a monitored sequence of read operations. For example, as described in U.S. patent application Ser. No. 12/605,119, entitled: STORAGE DEVICE PREFETCH SYSTEM USING DIRECTED GRAPH CLUSTERS, filed Oct. 23, 2009 and U.S. patent application Ser. No. 12/619,609, entitled: CLUSTER CONTROL PROTOCOL, filed Nov. 16, 2009 which have both been incorporated by reference in their entirety. In a further possible embodiment, configuration 200 is managed by some external entity.
Region mapping system 100 interprets input stream 300 consisting of read operations 302 and produces an output stream 400 consisting of region operation elements 402. Each operation element 402 in output stream 400 contains a region number as opposed to the start address and read length of the original read operation. For example, read operation 302A “Time A, read 50-70” in input stream 300 is converted by the prefetch controller 18 into operation element 402A “Time A, Read 1” in output stream 400.
Output stream 400 is thus a transformation of input stream 300 such that the numeric range of regions is substantially lower than the range of read addresses possible in input stream 300. In a typical example, input stream 300 may span an address space from 0 to several hundred million (N×10e8) while output stream 400 contains at most 100,000 (1×10e6) regions. Thus, first region mapping system 100 bounds the number of different read accesses that have to be further analyzed to a substantially smaller subset of regions.
The output stream 400 of operation elements are input into delay First In-First Out (FIFO) memory device 500 where it is output at some pre-configurable time later as delayed output stream 600. The length of delay (time output stream 400 is retained in FIFO 500) is chosen based on experimental knowledge with typical lengths of 5 minutes, 10 minutes or 15 minutes. For example, a particular operation element 402A “Time A, Read 1” in output stream 400 is output as part of delayed output 600 five minutes after it is initially loaded into FIFO 500.
The FIFO 500 is controlled by the prefetch controller 18 so that any operation element 402 input into FIFO 500 is output in approximately the same amount of time data can be stored in the tiering media 16. For example, if data can only be stored in tiering media 16 for 5 minutes, then the operation elements 402 are only stored in FIFO 500 for 5 minutes. This allows the prefetch controller 10 to generate statistics in the form of temporal histograms that can then be used to identify good read patterns for prefetching. Of course other time delays could also be used based on the data access patterns of client 10 with disk 20 (
The time delay in FIFO 500 sets the pattern recognition time of a next analysis stage. In practice, delay FIFO 500 is designed to receive an arbitrary number of operation elements from output stream 400. Upon each insertion, software in the prefetch controller compares the timestamp of the inserted operation element 402 against the time stamp of the oldest operation element still in the FIFO 500. Conceptually, the oldest operation element is the next item output/removed from the bottom end of the FIFO 500.
If the time difference between the top/newest operation element 402 in FIFO 500 and the oldest operation element 402 at the bottom of FIFO 500 is longer than the configured time length (e.g., 5 minutes), the oldest operation element is removed from the FIFO 500. Removal continues until the oldest operation element 402 is less than the “time length” older than the last inserted element (e.g., less than 5 minutes). In this manner, delay FIFO 500 will contain operation elements 402 which differ in timestamp value by no more than the “time length” value.
Upon each insertion of an operation element 402 into delay FIFO 500, counter system 700 increments a selected counter among current counters 710 indexed to the region of the associated read operation. In the example shown in
Upon each extraction of an operation element 402 from delay FIFO 500, statistics counters system 700 decrements a selected one of current counters 710 corresponding with the region of the extracted region operation element 402.
For example, the read operation 300A with timestamp A is mapped to region 1 as region operation element 402A. At a time after time A, set by the “time length” and given as 5 minutes in this example, the operation element 402A is extracted/deleted from delay FIFO 500 causing region 1 current counter 710A to decrement.
Prior to the counter decrementing, system 700 compares the region 1 current counter 710A with a corresponding region 1 highest counter 720A. Highest counter 720A is part of a parallel set of counters, Highest Counters 720, that maintain the highest watermarks for Current Counters 710. When region 1 current counter 710A is higher than the region 1 highest counter 720A, the value in current counter 710A replaces the value in region 1 highest counter 720A. In this manner, the highest counters 720 maintain the highest number of read operations that occurred within each region during the 5 minutes time window/delay within FIFO 500.
Referring also to
For example, the output stream 400 shows operation elements 402A and 402E associated with reads to region 1 and occurring at times A and E, respectively. The operation elements 402A and 402E indicate back to back reads to region 1 that have a time delay of 15 seconds. This total time delay between the reads associated with elements 402A and 402E is shown calculated underneath difference counters 740. Upon extraction of the time A for operation element 402A from FIFO 500 in
Upon extraction of the operation element 402E from FIFO 500 identifying time E and region 1, the time difference between A and E (15 seconds) is determined by timestamp difference mapping system 800 to be within a time range in configuration 810 that corresponds to an index value of 4 (range between 5.0 seconds to 60 seconds).
Time difference counters 740 are then updated by the prefetch controller 18 by incrementing the difference counter 740A that corresponds with region 1 and time index 4. The total number of time difference counters 740 is equal to the product of the number of regions in configuration register 200 and the number of time indices/ranges in configuration 810. The number and exact values of comparison time ranges in configuration 810 are determined through experimentation and analysis of sample data.
For example, Co-pending patent application Ser. No. 12/605,119, entitled: STORAGE DEVICE PREFETCH SYSTEM USING DIRECTED GRAPH CLUSTERS, filed Oct. 23, 2009 and U.S. patent application Ser. No. 12/619,609, entitled: CLUSTER CONTROL PROTOCOL, filed Nov. 16, 2009 describes schemes for determining the values used in configuration register 200 and configuration 810. Of course other static or dynamic techniques can also be used to determine the values in configuration register 200 and time difference configuration 810.
Thus, the above prefetch controller 18 and the counters in
The prefetch controller 18 can also maintain average size counters 750 for each of the different storage regions identified in configuration register 200. The prefetch controller 18 maintains counters 752 that track the number of reads to each of the different regions in register 200 and counters 754 that track the total number of blocks read from each of the different regions identified in register 200. The prefetch controller 18 can then derive an average read size value in registers 756 for each of the different regions in register 200 by dividing the total number of blocks identified in counters 754 by the corresponding total number of reads identified in counters 752.
The average read size in register 756 can then be used to determine how much data to prefetch from an associate region of disk 20. For example, the highest count 720 for a particular region may be multiplied by the average block read size in register 756. The resulting value would be the amount of data that is prefetched from a particular region. This is described in more detail below in
It has been determined that prefetching is highly effective when two conditions are met:
1) The entire region is read completely during a pattern of activity
2) The entire region is read quickly during a pattern of activity
Accordingly, to determine prefetch suitability of a region, the highest counters 720 and time difference counters 740 for each region are examined.
The time difference counter 740B for region 1 records eight back to back reads that happen within 0.5 seconds of each other. This corresponds to the four back to back reads between time t1 and t5, and the four back to back reads between time t6 and t10. Accordingly, counter 740B is incremented by the prefetch controller to a value of eight.
There is one back to back read that takes between 5 and 60 seconds. This corresponds to the time between the read of blocks 81-100 at time t5 and the read of blocks 0-20 at time t6. Accordingly, counter 740A was incremented to a value of one.
Two patterns traits typically indicate a region should not be prefetched. One condition is that the back to back reads are too far away from each other in time. The second condition is that not enough back to back read operations occur within a given time period, such as within the example specified value five minutes that data can be stored in the tiring media 16.
To account for both of these conditions a back to back read ratio is determined by first dividing the value 8 in the shortest time interval difference counter 740B by the highest count value 5 in the highest counter 720A for region 1 (e.g., 8÷5=1.6). This provides a ratio for the number of substantially sequential back to back read operations within the 5 minute time window. To determine if too many reads to region 1 are too far apart in time, the read ratio 1.6 is compared with a sum of all of the other counters 720 associated with region 1. This identifies the number of all other back to back reads that were spaced apart in time more than the 0.5 seconds associated with time difference counter 740B.
In this example there is only one additional back to back read operation recorded in counter 740A. This indicates that only one set of two read operations in region 1 were spaced apart more than 0.5 seconds. In this example, the two back to back reads were between 5.0 seconds and 60 seconds apart.
If the read ratio between the value in counter 740B and the value in counter 720A is larger than the sum of the values in the remaining counters 740 associated with region 1, then the region qualifies for prefetching. The prefetching analysis can be summarized as follows:
(Number of Back to Back Reads With Lowest Time Interval÷Largest Number of Reads Within Time Window)≧Sum of Remaining Back to Back Reads With Higher Time Intervals=prefetch.
In this example, since 1.6>1, the prefetch controller 18 identifies region 1 for prefetching. In an alternate embodiment, the ratio of counter 740B and counter 720A need only be greater than or within some margin of the sum of values of remaining counters 740. This margin can be obtained experimentally or programmed through configuration.
As mentioned above, the entire region may not necessarily be prefetched wherever there is a read to region 1. The value in average read size register 756 for region 1 identifies an average read size of 20 blocks. The prefetch controller 18 multiplies the highest count value in counter 720A for region 1 by the average read size value in register 756 for region 1 to derive the prefetch size. In this example the highest count value=5 and the average read size value=20. Accordingly, the prefetch size is determined by the prefetch controller to be 5×20=100. Accordingly, blocks 0-100 are fetched from disk 20 when a client 10 reads blocks 0-20 in region 1.
In this example, the entire region 1 is prefetched. However, assume that region 1 is 200 blocks and includes block 0 through block 200. Also assume that the highest count value in counter 720A is still=5 and the average read size is still 20. In this example, the prefetch size is still 100 blocks. A read to blocks 0-20 would still cause the prefetch controller 18 to prefetch blocks 0-100. However, a read to blocks 101-120 would cause the prefetch controller 18 to prefetch blocks 101-200. Thus, blocks are prefetched starting from the address of the last block in the monitored read operation from client 10. Subsequent reads to region 1 will then be serviced by the storage system 14, if possible, using the data prefetched into tiering memory 16.
This is just one example of a possible scheme for determining which read patterns qualify for prefetching and how much data to prefetch. For example, the read ratio may not necessarily have to be equal or larger than the summation of the values in the other time difference counters 740 and may just need to be within some range of the summation. It should also be understood that all of the preprogrammed values used for analyzing the storage access patterns may be reconfigurable.
In this example, the read pattern is analyzed as follows:
(Number of Back to Back Reads With Shortest Time Interval (1)÷Largest Number of Reads Within Time Window (3))<Sum of Remaining Back to Back Reads With Higher Time Intervals (2+1)=No Prefetch.
Since the ratio of the number of back to back read operations with the shorted time interval and the highest count value (1÷3=0.333) is substantially smaller than the sum of the remaining back to back reads with larger time intervals (2+1=3), this read pattern is determined not to qualify for prefetching. Thus in this example, region 1 will not be prefetched into tiering memory 16 by the prefetch controller 18 and reads to region 1 will be accessed from disk 20.
The scheme described above can quickly and easily compute the current counters 710, highest counters 720, and time difference counters 740 and does not grow in complexity or space requirement over time. The detection of regions with high prefetch potential is also computationally easy to derive from the contents of the counters.
In practice, the storage system 14 can be programmed to prefetch any address range for a region upon the first read to a particular region. Timeouts to the tiering media 16 can be set aggressively when the time difference counters 740 indicate fast sequential access. This additionally improves tiering performance since contents in tiering memory 16 can be quickly removed after the prefetch operation has supplied data for all the reads in a burst period. This optimization allows faster recovery and reuse of tiering resources and is based on the knowledge from time difference counters 740 that the periods of read bursts are intense but far between in time.
The system described above can use dedicated processor systems, micro controllers, programmable logic devices, or microprocessors that perform some or all of the operations. Some of the operations described above may be implemented in software and other operations may be implemented in hardware.
For the sake of convenience, the operations are described as various interconnected functional blocks or distinct software modules. This is not necessary, however, and there may be cases where these functional blocks or modules are equivalently aggregated into a single logic device, program or operation with unclear boundaries. In any event, the functional blocks and software modules or features of the flexible interface can be implemented by themselves, or in combination with other operations in either hardware or software.
Having described and illustrated the principles of the invention in a preferred embodiment thereof, it should be apparent that the invention may be modified in arrangement and detail without departing from such principles. We/I claim all modifications and variation coming within the spirit and scope of the following claims.
Number | Name | Date | Kind |
---|---|---|---|
6381677 | Beardsley et al. | Apr 2002 | B1 |
6401147 | Sang et al. | Jun 2002 | B1 |
6633955 | Yin et al. | Oct 2003 | B1 |
6678795 | Moreno et al. | Jan 2004 | B1 |
6721870 | Yochai et al. | Apr 2004 | B1 |
6742084 | Defouw et al. | May 2004 | B1 |
6789171 | Desai et al. | Sep 2004 | B2 |
6810470 | Wiseman et al. | Oct 2004 | B1 |
7017084 | Ng et al. | Mar 2006 | B2 |
7089370 | Luick | Aug 2006 | B2 |
7110359 | Acharya | Sep 2006 | B1 |
7856533 | Hur et al. | Dec 2010 | B2 |
7870351 | Resnick | Jan 2011 | B2 |
7873619 | Faibish et al. | Jan 2011 | B1 |
7975108 | Holscher et al. | Jul 2011 | B1 |
8214599 | de la Iglesia et al. | Jul 2012 | B2 |
20020035655 | Finn et al. | Mar 2002 | A1 |
20020175998 | Hoang | Nov 2002 | A1 |
20020194434 | Kurasugi | Dec 2002 | A1 |
20030012204 | Czeiger et al. | Jan 2003 | A1 |
20030167327 | Baldwin et al. | Sep 2003 | A1 |
20030177168 | Heitman et al. | Sep 2003 | A1 |
20040215923 | Royer | Oct 2004 | A1 |
20050025075 | Dutt et al. | Feb 2005 | A1 |
20050195736 | Matsuda | Sep 2005 | A1 |
20050204113 | Harper et al. | Sep 2005 | A1 |
20060005074 | Yanai et al. | Jan 2006 | A1 |
20060034302 | Peterson | Feb 2006 | A1 |
20060053263 | Prahlad et al. | Mar 2006 | A1 |
20060075191 | Lolayekar et al. | Apr 2006 | A1 |
20060112232 | Zohar et al. | May 2006 | A1 |
20060218389 | Li et al. | Sep 2006 | A1 |
20060277329 | Paulson et al. | Dec 2006 | A1 |
20070050548 | Bali et al. | Mar 2007 | A1 |
20070079105 | Thompson | Apr 2007 | A1 |
20070118710 | Yamakawa et al. | May 2007 | A1 |
20070283086 | Bates | Dec 2007 | A1 |
20080028162 | Thompson | Jan 2008 | A1 |
20080098173 | Chidambaran et al. | Apr 2008 | A1 |
20080162864 | Sugumar et al. | Jul 2008 | A1 |
20080215834 | Dumitru et al. | Sep 2008 | A1 |
20080250195 | Chow et al. | Oct 2008 | A1 |
20080320269 | Houlihan et al. | Dec 2008 | A1 |
20090006725 | Ito et al. | Jan 2009 | A1 |
20090006745 | Cavallo et al. | Jan 2009 | A1 |
20090034377 | English et al. | Feb 2009 | A1 |
20090110000 | Brorup | Apr 2009 | A1 |
20090240873 | Yu et al. | Sep 2009 | A1 |
20090259800 | Kilzer et al. | Oct 2009 | A1 |
20090276588 | Murase | Nov 2009 | A1 |
20090307388 | Tchapda | Dec 2009 | A1 |
20100011154 | Yeh | Jan 2010 | A1 |
20100030809 | Nath | Feb 2010 | A1 |
20100080237 | Dai et al. | Apr 2010 | A1 |
20100088469 | Motonaga et al. | Apr 2010 | A1 |
20100115206 | de la Iglesia et al. | May 2010 | A1 |
20100115211 | de la Iglesia et al. | May 2010 | A1 |
20100122020 | Sikdar et al. | May 2010 | A1 |
20100125857 | Dommeti et al. | May 2010 | A1 |
20100169544 | Eom et al. | Jul 2010 | A1 |
20100174939 | Vexler | Jul 2010 | A1 |
20110047347 | Li et al. | Feb 2011 | A1 |
20110258362 | McLaren et al. | Oct 2011 | A1 |
Entry |
---|
Rosenblum, Mendel and Ousterhout, John K., The LFS Storage Manager. Proceedings of the 1990 Summer Usenix. 1990 pp. 315-324. |
Mark Friedman, Odysseas Pentakalos. Windows 2000 Performance Guide. File Cache Performance and Tuning [reprinted online]. O'Reilly Media. Jan. 2002 [retrieved on Oct. 29, 2012]. Retrieved from the internet: <URL:http://technet.microsoft.com/en-us/library/bb742613.aspx#mainSection>. |
Stolowitz Ford Cowger Listing of Related Cases, Feb. 7, 2012. |
Number | Date | Country | |
---|---|---|---|
61144404 | Jan 2009 | US | |
61144395 | Jan 2009 | US |