The field relates to optimization of data writes to file systems. More precisely, the field relates to optimization of data writes to file systems according to the current load of processing data.
There are various modules within complex computer applications. Some of these modules are responsible for managing how, when and where transaction information is to be persisted on the file system of a computer system. Sometimes the computer system will operate under high load, which means that many transactions will be concurrently committed, rolled back and/or recovered. All these transactions may be referred to as “passing” through the system. If the number of these transactions is large, writing the data to the file system for each transaction separately will deteriorate system performance, potentially even to the point of causing system crashes. Moreover, the file system is a single resource, therefore if we choose to serialize transaction data separately for each transaction, it will become a bottleneck. A smart solution for that case is collecting batches of data modifications and writing them to file system together, in a single write operation. In this way, the number of write operations decreases and the system is able to better utilize the CPU and other resources. The second case is when the system is not so loaded with “passing” transactions. Then there is no need for collecting transaction data in batches and writing the data for more than one transaction in a single write operation to file system, because the system could crash due to external factors and in the process the information which has been collected in the volatile memory of the system may be lost. In other words, flushing file writes done in batches optimizes performance at the cost of higher probability of data loss in case of system crash, which means an optimal solution is switching between these two options, depending on the current “passing” transactions load. The load of real-life systems varies constantly between the two borderline cases. Therefore the system should be able to appropriately handle the two cases and should be able to switch instantly between them.
Various embodiments of systems and methods for file system transaction log flush optimization are described herein. In one embodiment, the method includes monitoring a current load of transaction information to be persisted on the file system, the transaction information comprising a set of records committed to the file system. The method also includes collecting the set of records at a buffer and flushing the buffer to the file system when the current load of the transaction information is above a predefined limit indicating high transaction load. The method further includes writing the committed set of records one by one without a mediation of the buffer when the current load of the transaction information is below a predefined limit indicating low transaction load.
In other embodiments, the system includes at least one processor for executing program code and memory, a file system repository, and a set of transaction records to be persisted on the file system repository. The system also includes a monitoring module within the memory, the monitoring module to monitor a current load of transaction records to be persisted on the file system repository and a collector module within the memory, the collector module to collect the set of transaction records at a buffer. The system further includes a flusher module within the memory, the flusher module to flush the buffer to the file system repository when the current load of the transaction records is above a predefined limit indicating high transaction load and a writer module within the memory, the writer module to write the set of records one by one without a mediation of the buffer when the current load of transaction records is below a predefined limit indicating low transaction load.
These and other benefits and features of embodiments of the invention will be apparent upon consideration of the following detailed description of preferred embodiments thereof, presented in connection with the following drawings.
The claims set forth the embodiments of the invention with particularity. The invention is illustrated by way of example and not by way of limitation in the figures of the accompanying drawings in which like references indicate similar elements. The embodiments of the invention, together with its advantages, may be best understood from the following detailed description taken in conjunction with the accompanying drawings.
Embodiments of techniques for file system transaction log flush optimization are described herein. In the following description, numerous specific details are set forth to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention can be practiced without one or more of the specific details, or with other methods, components, materials, etc. In other instances, well known structures, materials, or operations are not shown or described in detail to avoid obscuring aspects of the invention.
Reference throughout this specification to “one embodiment”, “this embodiment” and similar phrases, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of these phrases in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
To avoid this suboptimal behavior, active records 132 and compensation records 134 are stored also in the in-memory cache 140 for a certain period of time, thus the records being buffered and further written by a single flush operation in the TLOG file 130. This is done in order to optimize the usage of system resources (main memory, CPU). The moment for this flush is dynamically determined by the optimizer 120 and depends on transaction load. The factors taken into account for the flush moment are the number of “empty” flushes into the TLOG file 130; the timeout between two consecutive flushes; the maximum number of records stored in the in-memory cache 140 before being written in the TLOG file 130; and the number of records written in the previous flush operation (called last flush size). The first three factors can be configured depending on the system characteristics (hard disk performance, CPU, main memory available, etc.). An empty flush is defined as a flush in which 0 or 1 records are written in the TLOG file 130. If the number of consecutive empty flushes exceeds some constant value (10 for example), this indicates low transaction load and the optimizer 120 is switched off, because the overhead it incurs no longer pays off. In this mode, writing in the TLOG file 130 is done as soon as a transaction is prepared or completed, without buffering of records into main memory. The optimizer 120 is switched on again if a second transaction is successfully prepared while another record is being flushed in the TLOG file 130. Otherwise, this successfully prepared transaction (and following ones) would have to wait for the completion of the ongoing flush, delaying their execution.
In one embodiment, two events can trigger a flush: buffer overflow, which is exceeding the maximum number of records that can be stored in the in-memory cache 140, or exceeding the timeout between two consecutive flushes (25 milliseconds for example). When flush is triggered, the whole content of the buffer is written into the TLOG file 130 and the buffer is “emptied”. While emptying the buffer, no physical memory release is done—only the index in the buffer for the next transaction record to be stored is reset to 0, which means subsequent records overwrite the old ones. In one embodiment, the default value of this timeout is determined to correspond to an upper bound of the minimum latency between two consecutive I/O operations on a contemporary hard disk. If this timeout is too small, the hard disk will be unable to execute the flush operations in real time and will start some low-level buffering. If the timeout is too big, the risk of losing data due to system crash increases. In one embodiment, the default size of the buffer is set to 100; it is recommended to correspond to the maximum number of application threads running in parallel.
Besides the configurable properties, in one embodiment, the optimizer 120 is self-adapting to the transaction load and system performance, deciding to flush the buffer before any of the two above-described configurable limits is reached. The optimizer 120 maintains a dynamically calculated flush trigger parameter—“next flush size limit”. When the number of pending records in the buffer exceeds this parameter, flush is triggered. The parameter is calculated upon flush by the formula:
nextFlushSizeLimit=2*currentFlushSize−lastFlushSize,
where
currentFlushSize is the number of pending transaction records to be flushed currently and
lastFlushSize is the number of transaction records written with the previous flush. The formula is based on the assumption that the load varies linearly between consecutive flushes.
During flush, the buffer must be locked. In order to prevent blocking of other incoming transactions, the optimizer 120 maintains two buffers—one is active and upon flush it is locked, and the other becomes active, and so on. Since the TLOG file 130 grows with time, when its size reaches some predefined limit (8 MB for example), further flushes are locked, the in-memory cache 140 of the TLOG file 130 is recalculated (subtracting the compensation records 134 from the active records 132), a new TLOG file 130 is created and the active records 132 are flushed to it. Then the old TLOG file 130 is deleted and the flushes are unlocked.
At decision block 220, a check is performed to determine whether the transaction information is above a predefined limit. If the transaction information is above a predefined limit indicating high transaction load, at block 230 the set of records are collected at a buffer and the buffer is flushed to the file system. In one embodiment, the buffer is flushed when the buffer is full or a predefined timeout has passed. This means the flush is triggered either when the maximum number of records that may be stored in the buffer is reached or a predefined timeout between two consecutive flushes is exceeded. In one embodiment the flush is triggered by another event, which is reaching a flush size limit, which is less than the maximum number of records that may be stored in the buffer. In one embodiment, the flush size limit is dynamically calculated. In yet another embodiment, the dynamically calculated flush size limit increases, when the load of transaction information rises, and decreases, when the load of transaction information diminishes. In one embodiment the dynamically calculated flush size limit increases or decreases linearly.
If at block 220, the transaction information is defined as below a predefined limit indicating low transaction load, the method continues at block 240 with writing the committed set of records one by one without mediation of a buffer. In one embodiment low transaction load is indicated by a predefined number of empty flushes of the buffer to the file system. In one embodiment, an empty flush is defined as a flush in which zero or one records are flushed.
When a low transaction load is indicated by the monitoring module 340, the writer module 380 writes the transaction records 330 one by one to the file system repository 370 without a mediation of the buffer 355. In one embodiment, a low transaction load is indicated by reaching a predefined number of empty flushes of the buffer 355 to the file system repository 370. In one embodiment, an empty flush is defined as a flush in which zero or one transaction records 330 are flushed by the flusher module 360 to the file system repository 380.
Some embodiments of the invention may include the above-described methods being written as one or more software components. These components, and the functionality associated with each, may be used by client, server, distributed, or peer computer systems. These components may be written in a computer language corresponding to one or more programming languages such as, functional, declarative, procedural, object-oriented, lower level languages and the like. They may be linked to other components via various application programming interfaces and then compiled into one complete application for a server or a client. Alternatively, the components maybe implemented in server and client applications. Further, these components may be linked together via various distributed programming protocols. Some example embodiments of the invention may include remote procedure calls being used to implement one or more of these components across a distributed programming environment. For example, a logic level may reside on a first computer system that is remotely located from a second computer system containing an interface level (e.g., a graphical user interface). These first and second computer systems can be configured in a server-client, peer-to-peer, or some other configuration. The clients can vary in complexity from mobile and handheld devices, to thin clients and on to thick clients or even other servers.
The above-illustrated software components are tangibly stored on a computer readable storage medium as instructions. The term “computer readable storage medium” should be taken to include a single medium or multiple media that stores one or more sets of instructions. The term “computer readable storage medium” should be taken to include any physical article that is capable of undergoing a set of physical changes to physically store, encode, or otherwise carry a set of instructions for execution by a computer system which causes the computer system to perform any of the methods or process steps described, represented, or illustrated herein. Examples of computer readable storage media include, but are not limited to: magnetic media, such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROMs, DVDs and holographic devices; magneto-optical media; and hardware devices that are specially configured to store and execute, such as application-specific integrated circuits (“ASICs”), programmable logic devices (“PLDs”) and ROM and RAM devices. Examples of computer readable instructions include machine code, such as produced by a compiler, and files containing higher-level code that are executed by a computer using an interpreter. For example, an embodiment of the invention may be implemented using Java, C++, or other object-oriented programming language and development tools. Another embodiment of the invention may be implemented in hard-wired circuitry in place of, or in combination with machine readable software instructions.
A data source is an information resource. Data sources include sources of data that enable data storage and retrieval. Data sources may include databases, such as, relational, transactional, hierarchical, multi-dimensional (e.g., OLAP), object oriented databases, and the like. Further data sources include tabular data (e.g., spreadsheets, delimited text files), data tagged with a markup language (e.g., XML data), transactional data, unstructured data (e.g., text files, screen scrapings), hierarchical data (e.g., data in a file system, XML data), files, a plurality of reports, and any other data source accessible through an established protocol, such as, Open DataBase Connectivity (ODBC), produced by an underlying software system (e.g., ERP system), and the like. Data sources may also include a data source where the data is not tangibly stored or otherwise ephemeral such as data streams, broadcast data, and the like. These data sources can include associated data foundations, semantic layers, management systems, security systems and so on.
In the above description, numerous specific details are set forth to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however that the invention can be practiced without one or more of the specific details or with other methods, components, techniques, etc. In other instances, well-known operations or structures are not shown or described in details to avoid obscuring aspects of the invention.
Although the processes illustrated and described herein include series of steps, it will be appreciated that the different embodiments of the present invention are not limited by the illustrated ordering of steps, as some steps may occur in different orders, some concurrently with other steps apart from that shown and described herein. In addition, not all illustrated steps may be required to implement a methodology in accordance with the present invention. Moreover, it will be appreciated that the processes may be implemented in association with the apparatus and systems illustrated and described herein as well as in association with other systems not illustrated.
The above descriptions and illustrations of embodiments of the invention, including what is described in the Abstract, is not intended to be exhaustive or to limit the invention to the precise forms disclosed. While specific embodiments of, and examples for, the invention are described herein for illustrative purposes, various equivalent modifications are possible within the scope of the invention, as those skilled in the relevant art will recognize. These modifications can be made to the invention in light of the above detailed description. Rather, the scope of the invention is to be determined by the following claims, which are to be interpreted in accordance with established doctrines of claim construction.