In many storage systems, file system software supports a feature called Read Ahead. Most file system read access patterns are sequential. Numerous empirical studies over the past three decades have shown that almost all file accesses are sequential (e.g., well over 90% of all read accesses). Special code, called Read Ahead, takes advantage of this read access pattern. The file system software keeps track of where in a file the client thread last executed a read operation. When the thread next executes a read, the file system software checks whether the latest read sequentially follows the previous read. If yes, the file system knows that the thread is performing sequential reads. Reading data from disk or other storage memory is a very slow operation compared to activity happening in main memory. The file system automatically reads in data that sequentially follows the data that has just been requested. In some current file systems the amount of additional data that will be read is a fixed amount. The next time the client thread executes a file read operation, the file system can satisfy the read operation using data that is either already in memory or is already on its way to main memory, which significantly improves performance. Whenever the file system detects that the read request is not sequential, the file system either stops reading additional information or reduces the amount of data to a small amount. The fundamental problem with all current forms of Read Ahead is that the current technology does not recognize whether the Read Ahead feature is effective.
Prior solutions use a fixed maximum value for the amount of data to read ahead. It is not possible to have one single best value for all platforms and workloads. One approach attempts to guess or try a few values empirically and choose a value that seems like a reasonable compromise for the range of platforms that they plan to support with their file system. If a large value that is appropriate for a system is chosen, the read ahead will consume too much memory on smaller systems, which can cause the system to fail. Thus, a value for read ahead that is too small for achieving the best performance for a system is typically chosen to avoid any failure. The old forms of read ahead achieved useful improvements in read performance, but much less than what was possible. File system read operations tend to be the most important file system operation impacting file system performance and achieving optimal file system read performance is essential for maximizing file system performance.
It is within this context that the embodiments arise.
In some embodiments, a processor-based method for adaptive read ahead is provided. The method includes satisfying a read request with sequential reads from a page cache in a first memory and read ahead from a storage memory to the page cache in the first memory, and adjusting upward an amount of data to be read by a cycle of the read ahead, responsive to a determination that a desired page of data for the read request is not in the page cache.
In some embodiments, a tangible, non-transitory, computer-readable media having instructions thereupon which, when executed by a processor, cause the processor to perform a method is provided. The method includes performing read ahead cycles, with an amount of data read from a storage memory to a page cache in a first memory in each of the read ahead cycles, to satisfy a read request having sequential reads from the page cache. The method includes determining a status of a desired page of data for the read request, and increasing the amount of data to be read in at least one subsequent read ahead cycle, as a result of determining the desired page is not in the page cache at a time of servicing a sequential read of the desired page.
In some embodiments, adaptive read ahead system is provided. The system includes at least one processor and an adaptive read ahead module in cooperation with the at least one processor. The adaptive read ahead module has a data read ahead adjuster and a read ahead director. The read ahead director is configured to perform read ahead cycles, each read ahead cycle reading an amount of data from a storage memory to a page cache in a first memory, responsive to a read request having sequential reads from the page cache. The data read ahead adjuster is configured to determine whether a desired page for the read request is in the page cache and the data read ahead adjuster is configured to increase the amount of data for one or more read ahead cycles, responsive to determining that the desired page is not in the page cache.
Other aspects and advantages of the embodiments will become apparent from the following detailed description taken in conjunction with the accompanying drawings which illustrate, by way of example, the principles of the described embodiments.
The described embodiments and the advantages thereof may best be understood by reference to the following description taken in conjunction with the accompanying drawings. These drawings in no way limit any changes in form and detail that may be made to the described embodiments by one skilled in the art without departing from the spirit and scope of the described embodiments.
An adaptive read ahead system, described herein, determines whether read ahead is bringing in enough data to main memory, prior to data being read out of the main memory by sequential reads, to prevent stalling or excess delays. The system adjusts the amount of data to read ahead, in read ahead cycles, based on the whereabouts of the desired page of data that follows the current page being read out of main memory. If the desired page is found in a page cache in main memory, the amount of data to be read ahead in read ahead cycles is kept constant. If the desired page is not yet fully in the page cache and the read ahead value has not reached a maximum value, the amount of data to be read ahead in read ahead cycles is increased. By so adjusting the amount of data to read ahead, the adaptive read ahead system improves efficiency of the storage system and decreases latency of sequential reads.
If the desired page is neither in the page cache nor in transit, a read ahead cycle is triggered. Waiting for the sequential read to reach a point where there is no data in the page cache is undesirable, because that results in a stall. The read must wait a long time for the system to bring data from storage memory to the page cache. The present design attempts to avoid this problem. The read ahead value normally brings in more than one page. Instead of triggering a read ahead when the sequential read consumes all of the data in this set of pages being brought in by read ahead, the system triggers the next read ahead when the sequential read consumes only a small part of the pages that were or will be brought in by recent read ahead operations. Thus, the system reads in the next batch of pages, while reads consume pages brought into main memory by earlier read ahead operations.
Thus far in the description, the storage system 102 with read ahead has some similarities with known read ahead systems. However, the present adaptive read ahead module 110 uses a process and mechanism for an adjustable amount of data to read ahead 114, which differs from known read ahead systems. The adaptive read ahead module 110, in some embodiments, has a data read ahead adjuster 112 and a read ahead director 116. The read ahead director 116 performs read ahead cycles 118, each moving a next page 124 (depicted in dashed outline) from storage memory 126 to the page cache 124 in main memory 120. Status of desired pages (e.g., data that the storage system 102 seeks to read out for external I/O 128 in order to service a read request) is monitored by the data read ahead adjuster 112, which adjusts the amount of data to read ahead 114 under various conditions. The data read ahead adjuster 112 communicates the amount of data to read ahead 114 to the read ahead director 116, which then uses this value of the amount of data to read ahead 114 in the read ahead cycles 118. In some embodiments, the data read ahead adjuster 112 and read ahead director 116 are integrated. In some embodiments, the data read ahead adjuster 112 augments a standard read ahead module. And, in some embodiments the data read ahead adjuster 112 and read ahead director 116 replace a standard read ahead module. Operations of various embodiments of the read ahead director 116 and the data read ahead adjuster 112 are further described below with reference to
Obtaining the status of a desired page 206, the data read ahead adjuster 112 determines whether the desired page is entirely present in the page cache 122 (e.g., as a current page 132), is in the process of being read from the storage memory 126 to the page cache 122 in the main memory 120 (e.g., as the next page 124), or is not yet in transit nor present in the page cache 122 (e.g., when there is not a read ahead cycle 118 transferring the next page 124 to the page cache 122). Based on these possibilities, the data read ahead adjuster 112 sets the value of the amount of data to read ahead 114. While the current read request is being serviced out of the page cache 122 in main memory 120 (e.g., reading out to the network 104 and/or the computer or user device 106), if the desired page is complete in the page cache 122 of main memory 120 (i.e., the desired page is the current page 132 in the page cache 122), the data read ahead adjuster 112 keeps the amount of data to read ahead 114 the same for the next read ahead cycle as in the previous read ahead cycle. In conceptual terms, the value setter 208 is kept at a constant setting for the amount of data to read ahead 114. This is because the read ahead cycles 118 are determined to be keeping up with the sequential reads of current pages 132 from the page cache 122.
While the current read request is being serviced out of the page cache 122, if the desired page is incomplete in the page cache 122, or absent from the page cache 122, but is in transit from the storage memory 126 to the page cache 122 (e.g., is being read using internal I/O 130 as the next page 124 from the storage memory 126 to the page cache 122, but is not yet completely in the page cache 122), the data read ahead adjuster 122 increases the amount of data to read ahead 114 for the next read ahead cycle as compared to the previous read ahead cycle. In conceptual terms, the value setter 208 for the amount of data to read ahead 114 is moved or adjusted upward. This is because the read ahead cycles 118 are determined to be not quite keeping up with the sequential reads of current pages 132 from the page cache 122, warranting the increase in the amount of data to read ahead 114. The new value of the amount of data to read ahead 114 may be kept for one or more subsequent read ahead cycles 118.
While the current read request is being serviced and the sequential read has consumed a specified amount of the pages that were or are about to be brought into the main memory 120 by the recent read ahead operations, if a desired page is not in the page cache 122 and is also not in transit from the storage memory 126 to the page cache 122, the data read ahead adjuster 112 cooperates with the read ahead director 116 to trigger a read ahead cycle 118. This is so as to get the desired page, which will become the next page 124, moving towards and into the page cache 122, for an upcoming read of a current page 132. Upon arrival in the page cache 122, the next page 124 may become the current page 132 for a read request (assuming the sequential reads are continued). In various embodiments, the amount of data to read ahead 114 could be left as is, or adjusted upward.
In some embodiments, the value setter 208 is constrained by an initial amount 202 and a maximum amount 204. These could be default values, predetermined or system-dependent. Or one or both of these could be variables. In one embodiment, the maximum amount 204 is set initially, but decreased if a low memory condition is detected in the main memory 120. In other words, the amount of data to read ahead 114 could be set anywhere between the initial amount 202 and the maximum amount 204, which could be adjusted downward in event of a low memory condition.
With reference to
Read ahead is a heuristic. In other words, the read ahead operation does not have to be correct 100% of the time. Various algorithms can be wrong without causing any failures. In some embodiments, an algorithm is useful as long as it is correct most of the time. The Adaptive Read Ahead is correct the vast majority of the time and performance tests show a large performance improvement.
In some embodiments, the above test plays an important role. The adaptive read ahead can reliably determine whether read ahead is bringing data into memory in time to avoid costly waits and adjust the amount of data to read ahead accordingly. This dynamically adjusts the amount of read ahead. The appropriate amount of data to read ahead varies across hardware platforms. Many hardware factors may affect the read ahead setting. These may include, but are not limited to: CPU processing speed, main memory access speed, and storage subsystem speed. The appropriate amount of data to read ahead may also vary depending upon the system load. Various embodiments dynamically adjust to the needs of the system. The above acts as a feedback mechanism that decreases read latency in a storage system. This improves system performance (especially, read performance in a storage system) by eliminating or significantly reducing the amount of time a client thread waits for data delivered by a file system read operation (i.e., improves read latency). Furthermore, embodiments can avoid being overly aggressive with read ahead, and thus avoid consuming main memory that is not required for good performance. The adaptive read ahead mechanism starts with a small value for the amount of data to read ahead, and rapidly grows until a good setting is determined for the amount of data to read ahead. The adaptive read ahead mechanism thus avoids using a setting that is either too high or too low for the amount of data to read ahead.
Similar to known systems, some embodiments of the adaptive read ahead module 110 check for sequential versus non-sequential reads, and if the system detects that the read request is not sequential, the system stops the read ahead cycles 118 or decreases the amount of data to read ahead 114. The file system adaptive read ahead mechanism automatically adjusts the amount of data to read ahead 114 in order to eliminate, or at least minimize, the amount of time that a client thread must wait for the data delivered by a file system read request. The file system adaptive read ahead mechanism significantly improves file system performance by eliminating, or at least minimizing, the amount of waiting involved in file system read operation. Note that file system read performance is a very important factor in overall file system performance. The file system adaptive read ahead mechanism eliminates the need for extensive empirical tests that are often used to choose a value for the amount of data to read ahead. An important aspect of the file system adaptive read ahead is that it uses information about whether read ahead has succeeded in bringing data into memory prior to the client thread requesting the data. This information is important for correctly dynamically adjusting the amount of data to bring into main memory via read ahead. This mechanism or process can be used in most modern operating systems.
In the decision action 304, it is determined whether a desired page is in the page cache. If the answer is yes, the desired page is in the page cache, flow proceeds to the decision action 308. If the action is no, the desired page is not in the page cache, flow proceeds to the action 306. In the decision action 308, is determined whether there is a page identified as the page to trigger a read ahead. If the answer in the decision action 308 is no, there is not yet a page identified as the page to trigger a read ahead, flow branches back to the decision action 302, to see if there is a sequential read of a file in progress. If the answer in the decision action 308 is yes, there is a page identified as the page to trigger a read ahead, then flow proceeds to the action 316, to trigger a read ahead.
In the action 306, arrived at because the desired page is not in the page cache, the amount of data to read ahead is increased, but not so as to exceed an upper limit Flow proceeds to the decision action 310 where it is determined whether there is a low memory condition. If there is not a low memory condition, flow proceeds to the decision action 314. If there is a low memory condition, flow proceeds to the action 312. In the action 312, the upper limit on the amount of data to read ahead is set at a specified value, or decreased. The resultant specified or decreased value should be appropriate to the low memory condition.
Flow proceeds to the decision action 314 where is determined whether there is a page already being read in from storage. In some versions, this is the same desired page referred to in the decision action 304. In other versions, this is a different (e.g., newer or more recent) desired page. If the answer in the decision action 314 is yes, the page is already being read in from storage, flow branches back to the decision action 302, to see whether there is a sequential read of a file in progress. If the answer in the decision action 314 is no, the page is not already being read in from storage, flow proceeds to the action 316, to trigger a read ahead. After the action 316, flow proceeds back to the decision action 302, to determine whether a sequential read of a file is in progress.
It should be appreciated that the methods described herein may be performed with a digital processing system, such as a conventional, general-purpose computer system. Special purpose computers, which are designed or programmed to perform only one function may be used in the alternative.
Display 411 is in communication with CPU 401, memory 403, and mass storage device 407, through bus 405. Display 411 is configured to display any visualization tools or reports associated with the system described herein. Input/output device 409 is coupled to bus 405 in order to communicate information in command selections to CPU 401. It should be appreciated that data to and from external devices may be communicated through the input/output device 409. CPU 401 can be defined to execute the functionality described herein to enable the functionality described with reference to
Detailed illustrative embodiments are disclosed herein. However, specific functional details disclosed herein are merely representative for purposes of describing embodiments. Embodiments may, however, be embodied in many alternate forms and should not be construed as limited to only the embodiments set forth herein.
It should be understood that although the terms first, second, etc. may be used herein to describe various steps or calculations, these steps or calculations should not be limited by these terms. These terms are only used to distinguish one step or calculation from another. For example, a first calculation could be termed a second calculation, and, similarly, a second step could be termed a first step, without departing from the scope of this disclosure. As used herein, the term “and/or” and the “/” symbol includes any and all combinations of one or more of the associated listed items.
As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises”, “comprising”, “includes”, and/or “including”, when used herein, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Therefore, the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting.
It should also be noted that in some alternative implementations, the functions/acts noted may occur out of the order noted in the figures. For example, two figures shown in succession may in fact be executed substantially concurrently or may sometimes be executed in the reverse order, depending upon the functionality/acts involved.
With the above embodiments in mind, it should be understood that the embodiments might employ various computer-implemented operations involving data stored in computer systems. These operations are those requiring physical manipulation of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. Further, the manipulations performed are often referred to in terms, such as producing, identifying, determining, or comparing. Any of the operations described herein that form part of the embodiments are useful machine operations. The embodiments also relate to a device or an apparatus for performing these operations. The apparatus can be specially constructed for the required purpose, or the apparatus can be a general-purpose computer selectively activated or configured by a computer program stored in the computer. In particular, various general-purpose machines can be used with computer programs written in accordance with the teachings herein, or it may be more convenient to construct a more specialized apparatus to perform the required operations.
A module, an application, a layer, an agent or other method-operable entity could be implemented as hardware, firmware, or a processor executing software, or combinations thereof. It should be appreciated that, where a software-based embodiment is disclosed herein, the software can be embodied in a physical machine such as a controller. For example, a controller could include a first module and a second module. A controller could be configured to perform various actions, e.g., of a method, an application, a layer or an agent.
The embodiments can also be embodied as computer readable code on a tangible non-transitory computer readable medium. The computer readable medium is any data storage device that can store data, which can be thereafter read by a computer system. Examples of the computer readable medium include hard drives, network attached storage (NAS), read-only memory, random-access memory, CD-ROMs, CD-Rs, CD-RWs, magnetic tapes, and other optical and non-optical data storage devices. The computer readable medium can also be distributed over a network coupled computer system so that the computer readable code is stored and executed in a distributed fashion. Embodiments described herein may be practiced with various computer system configurations including hand-held devices, tablets, microprocessor systems, microprocessor-based or programmable consumer electronics, minicomputers, mainframe computers and the like. The embodiments can also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a wire-based or wireless network.
Although the method operations were described in a specific order, it should be understood that other operations may be performed in between described operations, described operations may be adjusted so that they occur at slightly different times or the described operations may be distributed in a system which allows the occurrence of the processing operations at various intervals associated with the processing.
In various embodiments, one or more portions of the methods and mechanisms described herein may form part of a cloud-computing environment. In such embodiments, resources may be provided over the Internet as services according to one or more various models. Such models may include Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS). In IaaS, computer infrastructure is delivered as a service. In such a case, the computing equipment is generally owned and operated by the service provider. In the PaaS model, software tools and underlying equipment used by developers to develop software solutions may be provided as a service and hosted by the service provider. SaaS typically includes a service provider licensing software as a service on demand. The service provider may host the software, or may deploy the software to a customer for a given period of time. Numerous combinations of the above models are possible and are contemplated.
Various units, circuits, or other components may be described or claimed as “configured to” perform a task or tasks. In such contexts, the phrase “configured to” is used to connote structure by indicating that the units/circuits/components include structure (e.g., circuitry) that performs the task or tasks during operation. As such, the unit/circuit/component can be said to be configured to perform the task even when the specified unit/circuit/component is not currently operational (e.g., is not on). The units/circuits/components used with the “configured to” language include hardware—for example, circuits, memory storing program instructions executable to implement the operation, etc. Reciting that a unit/circuit/component is “configured to” perform one or more tasks is expressly intended not to invoke 35 U.S.C. 112, sixth paragraph, for that unit/circuit/component. Additionally, “configured to” can include generic structure (e.g., generic circuitry) that is manipulated by software and/or firmware (e.g., an FPGA or a general-purpose processor executing software) to operate in manner that is capable of performing the task(s) at issue. “Configured to” may also include adapting a manufacturing process (e.g., a semiconductor fabrication facility) to fabricate devices (e.g., integrated circuits) that are adapted to implement or perform one or more tasks.
The foregoing description, for the purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the invention to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain the principles of the embodiments and its practical applications, to thereby enable others skilled in the art to best utilize the embodiments and various modifications as may be suited to the particular use contemplated. Accordingly, the present embodiments are to be considered as illustrative and not restrictive, and the invention is not to be limited to the details given herein, but may be modified within the scope and equivalents of the appended claims.