1. Field
The present disclosure relates generally to a method, apparatus, system, and computer readable media for representing transactions in append-only datastores, and more particularly for representing transactions both on-disk and in-memory.
2. Background
Traditional datastores and databases are designed with log files and paged data and index files. Traditional designs store operations and data in log files and then move this information to paged database files, e.g., by reprocessing the operations and data. This approach has many weaknesses or drawbacks, such as the need for extensive error detection and correction when paged files are updated in place, the storage and movement of redundant information and the disk seek bound nature of in-place page updates.
In light of the above described problems and unmet needs as well as others, systems and methods are presented for providing direct representation of transactions both in-memory and on-disk. This is accomplished using a state collapse method, wherein the end state of a transaction is represented in-memory and written to disk upon commit.
For example, aspects of the present invention provide advantages such as streamlined and pipelined transaction processing, greatly simplified error detection and correction including transaction roll-back and efficient use of storage resources by eliminating traditional logging and page files containing redundant information and replacing them with append-only transaction end state files and associated index files.
Additional advantages and novel features of these aspects of the invention will be set forth in part in the description that follows, and in part will become more apparent to those skilled in the art upon examination of the following or upon learning by practice thereof.
Various aspects of the systems and methods will be described in detail, with reference to the following figures, wherein:
These and other features and advantages in accordance with aspects of this invention are described in, or will become apparent from, the following detailed description of various example illustrations and implementations.
The detailed description set forth below in connection with the appended drawings is intended as a description of various configurations and is not intended to represent the only configurations in which the concepts described herein may be practiced. The detailed description includes specific details for the purpose of providing a thorough understanding of various concepts. However, it will be apparent to those skilled in the art that these concepts may be practiced without these specific details. In some instances, well known structures and components are shown in block diagram form in order to avoid obscuring such concepts.
Several aspects of systems capable of providing representations of transactions for both disk and memory, in accordance with aspects of the present invention will now be presented with reference to various apparatuses and methods. These apparatus and methods will be described in the following detailed description and illustrated in the accompanying drawings by various blocks, modules, components, circuits, steps, processes, algorithms, etc. (collectively referred to as “elements”). These elements may be implemented using electronic hardware, computer software, or any combination thereof. Whether such elements are implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system.
By way of example, an element, or any portion of an element, or any combination of elements may be implemented using a “processing system” that includes one or more processors. Examples of processors include microprocessors, microcontrollers, digital signal processors (DSPs), field programmable gate arrays (FPGAs), programmable logic devices (PLDs), state machines, gated logic, discrete hardware circuits, and other suitable hardware configured to perform the various functionality described throughout this disclosure. One or more processors in the processing system may execute software. Software shall be construed broadly to mean instructions, instruction sets, code, code segments, program code, programs, subprograms, software modules, applications, software applications, software packages, routines, subroutines, objects, executables, threads of execution, procedures, functions, etc., whether referred to as software, firmware, middleware, microcode, hardware description language, or otherwise.
Accordingly, in one or more example illustrations, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or encoded as one or more instructions or code on a computer-readable medium. Computer-readable media includes computer storage media. Storage media may be any available media that can be accessed by a computer. By way of example, and not limitation, such computer-readable media can comprise random-access memory (RAM), read-only memory (ROM), Electrically Erasable Programmable ROM (EEPROM), compact disk (CD) ROM (CD-ROM) or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Disk and disc, as used herein, includes CD, laser disc, optical disc, digital versatile disc (DVD), floppy disk and Blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
Computer system 100 includes one or more processors, such as processor 104. The processor 104 is connected to a communication infrastructure 106 (e.g., a communications bus, cross-over bar, or network). Various software implementations are described in terms of this example computer system. After reading this description, it will become apparent to a person skilled in the relevant art(s) how to implement aspects of the invention using other computer systems and/or architectures.
Computer system 100 can include a display interface 102 that forwards graphics, text, and other data from the communication infrastructure 106 (or from a frame buffer not shown) for display on a display unit 130. Computer system 100 also includes a main memory 108, preferably RAM, and may also include a secondary memory 110. The secondary memory 110 may include, for example, a hard disk drive 112 and/or a removable storage drive 114, representing a floppy disk drive, a magnetic tape drive, an optical disk drive, etc. The removable storage drive 114 reads from and/or writes to a removable storage unit 118 in a well-known manner. Removable storage unit 118, represents a floppy disk, magnetic tape, optical disk, etc., which is read by and written to removable storage drive 114. As will be appreciated, the removable storage unit 118 includes a computer usable storage medium having stored therein computer software and/or data.
In alternative implementations, secondary memory 110 may include other similar devices for allowing computer programs or other instructions to be loaded into computer system 100. Such devices may include, for example, a removable storage unit 122 and an interface 120. Examples of such may include a program cartridge and cartridge interface (such as that found in video game devices), a removable memory chip (such as an EPROM, or programmable read only memory (PROM)) and associated socket, and other removable storage units 122 and interfaces 120, which allow software and data to be transferred from the removable storage unit 122 to computer system 100.
Computer system 100 may also include a communications interface 124. Communications interface 124 allows software and data to be transferred between computer system 100 and external devices. Examples of communications interface 124 may include a modem, a network interface (such as an Ethernet card), a communications port, a Personal Computer Memory Card International Association (PCMCIA) slot and card, etc. Software and data transferred via communications interface 124 are in the form of signals 128, which may be electronic, electromagnetic, optical or other signals capable of being received by communications interface 124. These signals 128 are provided to communications interface 124 via a communications path (e.g., channel) 126. This path 126 carries signals 128 and may be implemented using wire or cable, fiber optics, a telephone line, a cellular link, a radio frequency (RF) link and/or other communications channels. In this document, the terms “computer program medium” and “computer usable medium” are used to refer generally to media such as a removable storage drive 114, a hard disk installed in hard disk drive 112, and signals 128. These computer program products provide software to the computer system 100. Aspects of the invention are directed to such computer program products.
Computer programs (also referred to as computer control logic) are stored in main memory 108 and/or secondary memory 110. Computer programs may also be received via communications interface 124. Such computer programs, when executed, enable the computer system 100 to perform the features in accordance with aspects of the present invention, as discussed herein. In particular, the computer programs, when executed, enable the processor 110 to perform various features. Accordingly, such computer programs represent controllers of the computer system 100.
In an implementation where aspects of the invention are implemented using software, the software may be stored in a computer program product and loaded into computer system 100 using removable storage drive 114, hard drive 112, or communications interface 120. The control logic (software), when executed by the processor 104, causes the processor 104 to perform various functions as described herein. In another implementation, aspects of the invention are implemented primarily in hardware using, for example, hardware components, such as application specific integrated circuits (ASICs). Implementation of the hardware state machine so as to perform the functions described herein will be apparent to persons skilled in the relevant art(s).
In yet another implementation, aspects of the invention are implemented using a combination of both hardware and software.
When information is naturally ordered during creation, there is no need for a separate index, or index file, to be created and maintained. However, when information is created in an unordered manner, anti-entropy algorithms may be required to restore order and increase and lookup performance.
Anti-entropy algorithms, e.g., indexing, garbage collection, and defragmentation, help to restore order to an unordered system. These operations may be parallelizable. This enables the operations to take advantage of idle cores in multi-core systems. Thus, read performance is regained at the expense of extra space and time, e.g., disk indexes and background work.
Over time, append-only files may become large. Files may need to be closed and/or archived. In this case, new Real Time Key Logging (LRT) files, Real Time Value Logging (VRT) files, and Real Time Key Tree Indexing (IRT) files can be created, and new entries may be written to these new files. An LRT file may be used to provide key logging and indexing for a VRT file. An IRT file may be used to provide an ordered index of VRT files. LRT, VRT, and IRT files are described in more detail in U.S. Utility application Ser. No. 13/781,339, filed on Feb. 28, 2013, titled “Method and System for Append-Only Storage and Retrieval of Information, which claims priority to U.S. Provisional Application No. 61/604,311, filed on Feb. 28, 2012” the entire contents of both of which are incorporated herein by reference. Forming an index requires an understanding of the type of keying and how the files are organized in storage, e.g., how the on-disk index files are organized. An example logical illustration of file layout and indexing with an LRT file, VRT file, and IRT file is shown in
At 304, a transaction is begun, the transaction involving at least one datastore based on user or agent input. Beginning a transaction may include, e.g., accessing at least one key/value pair within a datastore.
The datastore involved in the transaction may be prepared, as at 312. Preparing a datastore may include appending a begin prepare transaction indication to the global transaction log when the prepare begins, acquiring a prepare lock for each datastore involved in the transaction, and appending an end prepare transaction indication to the global transaction log when the prepare ends. The begin prepare transaction indication and the end prepare transaction indication may identify, e.g., the transaction being prepared.
In addition, a workspace may be created at 314, the workspace including a user space context and a scratch segment maintaining key to information bindings. Transaction levels may be maintained. In an example, as transactions may be nested, transactions levels may be maintained, e.g., increased each time a new nested transaction is started and decreased each time a nested transaction is aborted or committed.
At 306, at least one of creation, maintenance, and update of a transaction state is performed. This may include copying a state of the datastore into a scratch segment at 316. The scratch segment may be updated throughout the transaction. Creating, updating, and/or maintaining the transaction state may include, e.g., using transaction save points, transaction restore points, and/or transaction nesting. Transaction save points may enable, e.g., a transaction to roll back operations to any save point without aborting the entire transaction. Transaction save points may be released with their changes being preserved. Transaction nesting may create, e.g., implicit save points. Thus, rolling back a nested transaction may not roll back the nesting transaction, and a rollback all operation may roll back both nested and nesting transactions.
The transaction is ended at 308, and the state of the transaction is written to memory in an append-only manner at 310, wherein the state comprises append-only key and value files. The append-only key and values files may, e.g., encode at least one boundary that represents the transaction. The append-only key and values files may represent, e.g., an end state of the transaction. For example, the state written to memory may be an end state of the scratch segment after the transaction has ended. The memory to which the state of the transaction is written may be non-transient, e.g., disk memory. Append-only transaction log files may group a plurality of files representing the transaction.
Key/value pairs may be considered modified when the key/value pair is created, updated, or deleted.
At 318, at least one lock may be acquired. For example, a lock for a segment in the transaction may be acquired. A read lock for a key/value pair read in the transaction may be acquired. Additionally, a write lock for a key/value pair modified in the transaction may be acquired. Locks may be acquired in order, and lock acquisition order may be maintained. Locks may be acquired in a consistent order, e.g., in order to avoid deadlocks.
A read lock may be promoted to a write lock when only one reader holds the read lock and when the reader needs to modify key/value pairs, e.g., in order to enable the reader to modify the key/value pairs. A reader in this case refers to the entity reading the key/value pair. The system may, e.g., promote a read lock to a write lock if that reader/entity is the exclusive holder of the read lock when it tries to modify the key/value pair.
The transaction state may be written to each datastore in an append-only manner after all datastore prepare locks have been acquired. VRT files may be appended before LRT files are appended.
Any acquired lock may be released when the transaction is ended. The locks may be released, e.g., in acquisition order.
As illustrated at 320, the transaction may be performed in a streamlined manner, or, the transaction may be performed in a pipelined manner, as described in more detail below. IO may be either synchronous or asynchronous. Transaction streamlining may comprise, e.g., a single-threaded, zero-copy, single-buffered method. Transaction streamlining may minimize per-transaction latency. Transaction pipelining may comprise a multi-threaded, double-buffered method. Transaction pipelining may maximize transaction throughput.
At 322, the transaction may be aborted. During the prepare state, this may include releasing all associated prepare locks in a consistent acquisition order. The transaction state may be written to a VRT file and/or a LRT file, wherein the transaction state is either rolled back or identified with an append-only erasure indication. An abort transaction indication may be appended to a global transaction log, the abort transaction indication indicating the transaction aborted. Aborting the transaction may include releasing any acquired segment and key/value locks in acquisition order.
At 324, a global append-only transaction log file may be used. Flags may be used, e.g., to indicate a transaction state. Such flags may represent any of a begin prepare transaction, an end prepare transaction, a commit transaction, an abort transaction, and no outstanding transactions. A no outstanding transactions flag may be used as a checkpoint enabling fast convergence of error recovery algorithms.
Transactions and/or files may be identified by UUIDs. Transactions may, e.g., be distributed. A time stamp may be used in order to record a transaction time. Such timestamps may comprise either wall clock time, e.g., UTC, or time measured in ticks, e.g., Lamport timestamp.
At 326, the transaction may be committed. Committing the transaction may cause the transaction to be prepared and may follow a successful transaction preparation. A commit transaction indication may be appended to a global transaction log, the commit transaction indication indicating the transaction committed. Committing the transaction may include releasing any acquired segment and key/value locks in acquisition order.
In an aspect the steps described in connection with
Next, each ordered datastore is iterated over in 514 and each datastore is prepared in 516. Additional details are described in connection with
Once all ordered datastores are traversed at 604 their commit locks are released in acquisition order starting at 612. At 614 each datastore's commit lock is released and once all ordered datastores have been traversed the iteration over the datastores at 612 ends and a commit indication is written to the global transaction log at 616. Finally, a success status is returned at 618.
Creating a workspace within a datastore includes the creation of a userspace context at 812 and the creation of a scratch segment at 814. Once the workspace and its components have been created TRUE is returned at 816.
When the value element write succeeds as determined at 908 the associated key element is written to the LRT file at 910. If the key element write fails as determined at 912 the datastore transaction is aborted at 914, additional details are described in connection with
A successful key element write continues with iteration over the next Key/Information pair at 904. Finally, once all Key/Information pairs have been successfully written the iteration process at 904 ends and a success status is returned at 916.
Once all level ordered scratch segments are traversed in 1306 the next associated datastore is traversed in 1304. When datastore traversal is complete the current transaction level is set to the save point level−1 at 1312 and the process ends at 1314.
Once all scratch segments have been iterated over in 1506 the next associated datastore is iterated over in 1504. When all associated datastores have been iterated over the transaction level is set to the rollback level−1 in 1512 and the method ends at 1514.
Once the transaction's state has been written and the write lock released the wait count lock is acquired in 1816 and the wait count is decremented in 1818. If the wait count is non-zero as determined by 1820 the method releases the wait count lock at 1830 and waits for zero notification in 1832. When a zero notification occurs at 1830 the method ends at 1828.
If the wait count is equal to zero at 1820 the file system is synchronized in 1822 and all waiting requests are notified of zero in 1824. Finally, the wait count lock is released at 1826 and the method ends at 1828.
Thus, in accordance with aspects presented herein, transactions can group operations into atomic, isolated, and serialize-able units. There may be two major types of transactions, e.g., transactions within a single datastore and transactions spanning datastores. Transactions may be formed in-memory, e.g., with a disk cache for large transactions, and may be flushed to disk upon commit. Thus, information in LRT, VRT, and IRT files may represent commit transactions rather than intermediate results.
Once a transaction is committed to disk, the in-memory components of the datastore, e.g., the active segment tree, may be updated as necessary. In one example, committing to disk first, and then applying changes to the shared in-memory representation while holding the transaction's locks may enforce transactional semantics. All locks associated with the transaction may be removed, e.g., once the shared in-memory representation is updated.
Transactions may be formed in-memory before they are either committed or rolled-back. Isolation may be maintained by ensuring transactions in process do not modify shared memory, e.g., the active segment tree, until the transactions are successfully committed.
Global, e.g., database, transactions may span one to many datastores. Global transactions may coordinate an over-arching transaction with datastore level transactions. Global transactions may span both local datastores and distributed datastores. Architecturally, transactions spanning datastores may have the same semantics. This may be accomplished through the use of an atomic commitment protocol for both local and distributed transactions. More specifically, an enhanced two-phase commit protocol may be used.
All database transactions may be given a Universally Unique Identifier (UUID) that enables them to be uniquely identified without the need for distributed ID coordination, e.g., a Type 4 UUID. This transaction UUID may be carried between systems participating in the distributed transaction and may be stored, e.g., in transaction logs.
When a transaction spanning multiple datastores is committed, the global transaction log for those distributions may be maintained, e.g., in two phases—a prepare phase and a commit phase.
As illustrated in
Each datastore has a commit lock that may be acquired during the prepare phase and before the transaction log is updated with the global transaction ID or the datastore UUIDs of the attached datastores. The datastore commit locks may be acquired in a consistent order, e.g., to avoid the possibility of a deadlock. Once the commit locks are acquired and the prepare records are written to the global transaction log, the transaction may proceed, e.g., with prepare calls on each datastore comprised in the transaction. The datastore prepare phase may comprise writing the LRT/VRT files with the key/values comprised in their scratch segments. Once each datastore has been successfully prepared, the transaction moves to the commit phase.
During a transaction commit phase, a commit may be called on each of the datastores comprised in the transaction, releasing each datastore's commit lock. Then, the global transaction log may be updated with a commit record for the transaction. The commit record may comprise any of a commit flag set, a global transaction UUID, and a pointer to the start of a transaction record within the global transaction log file.
If any of the datastores comprised in the transaction cane be prepared during the prepare phase, an abort is performed. This may occur, e.g., when a write fails. An abort may be applied to roll back all written transaction information in each datastore comprised in the transaction. As described supra the start of each transaction position within each datastore may be written to the global transaction log during the prepare phase while holding all associated datastore commit locks. This may enable a rollback to be as simple as rewinding each LRT/VRT file insertion point for the transaction to the transaction's start location. At times, it may be desirable to preserve append-only operation and to have erasure code appended to the affected LRT/VRT files. Holding commit locks, e.g., may enable each LRT/VRT file to be written to by only one transaction at a time. An abort record for the transaction may then be appended to the global transaction log.
In an aspect, transactions within a datastore may be localized to and managed by that datastore. In such an aspect, transactions within the datastore may be initiated by a request to associate the datastore with a global transaction. An associated transaction request on a datastore may, e.g., create an internal workspace within the datastore. This may occur, e.g., for a new association. When a new association is created, a first indication may be returned. When the transaction was previously associated within the datastore, a second indication may be returned. For example, the first indication may comprise a “true” indication, while the second indication comprises a “false” indication. When a false indication is returned, e.g., and the existing workspace is used internally, at least one workspace object may maintain the context for all operations performed within a transaction on the datastore. A workspace may comprise a user space context and a scratch segment maintaining key to information bindings. Such a scratch segment may maintain a consolidated record of all last changes performed within the transaction. The record may be consolidated, e.g., because it may be a key to information structure where information comprises the last value change for a key. As a transaction progresses, the keys it accesses and the values that it modifies may be recorded in the workspace's segment.
Among others, there may be, e.g., four key/value access/update circumstances. First, such circumstances may include “created” indicating the transaction that created the key/value. Second, such circumstances may include “read” indicating a transaction that read the key/value. Third, such circumstances may include “updated” indicating a transaction that updated the key/value. Fourth, such circumstances may include “deleted” indicating a transaction that deleted the key/value.
Once a transaction access and/or updates a key/value, all subsequent accesses and/or updates for that key/value may be performed on the workspace's scratch segment. For example, it may be isolated from the active segment tree.
Locks may exist at both the active segment level and at the key/value level. Adding a new key/value to a segment may require an acquisition of a segment lock, e.g., for the segment that is being modified. This may further require the creation of a placeholder information objected within the active segment tree. Once an information object exists, it may be used for key/value level locking and state bookkeeping.
Lock coupling may be used to obtain top-level segment locks. Lightweight two phase locking may then be used for segment and information locking. Two phase locking implies all locks for a transaction may be acquired and held for the duration of the transaction. Locks may be released e.g., only after no further information will be accessed. For example, locks may be released at a commit or an abort.
State bookkeeping enables the detection of transaction collisions and deadlocks. Many transactions may read the same key/value. However, only one transaction may write a key/value at a time. Furthermore, once a key/value has been read in a transaction, it may not change during that transaction. If a second transaction attempts to write the key/value that a first transaction has read or written, a transaction collision is considered to have occurred. Such transaction collisions should be avoided, when possible. When avoidance may not be possible, it may be important to detect and resolve such collisions. Collision resolution may include, e.g., any of blocking on locks to coordinate key/value access; deadlock detection, avoidance, and recovery; and error reporting and transaction roll back.
During a prepare phase, when a datastore level transaction is prepared, its workspace's scratch segment may be written to a disk VRT file first and then to an LRT file.
During a commit phase, a successfully written transaction may be committed. When such a transaction is committed, any of (1) the active segment tree may be updated with the information in the workspace's scratch segment, (2) associated bookkeeping may be updated, and (3) all acquired locks may be released.
When an unsuccessful transaction is aborted and rolled back, any of (1) associated bookkeeping may be updated, (2) the LRT and VRT file pointers may be reset to the transaction start location, (3) all acquired locks may be released, (4) the workspace's scratch segment may be discarded, and (4) transaction error reporting may be performed. In order to reset the LRT and VRT file pointers to the transaction start location, e.g., the file lengths may be set to the transaction start location.
Transactions may be written to on-disk representation. Transactions written to disk may be delimited on disk to enable error detection and correction. Transaction delineation may be performed both within and between datastores. For example, group delimiters may identify transactions within datastore files. An append-only transaction log, e.g., referencing the transaction's groups within each datastore, may identify transactions between datastores. A datastore's LRT file may delimit groups using, e.g., a group start flag and a group end flag.
Index=>tuple of affected keys
Using this notation, LRT B has three group operations, 0=>(50, 70), 2=>(41, 42, and 43) and 5=>(80).
A transaction log may comprise, e.g., entries identifying each of the components of the transaction.
Flags may indicate, among other information, any of a begin prepare transaction, an end prepare transaction, a commit transaction, an abort transaction, and no outstanding transactions.
When a begin transaction is set, e.g., a UUID may be the transaction's ID and the size of the transaction may be specified, as illustrated in
When a committed transaction flag is set, UUID may be the committed transaction's UUID and the position may indicate a position of the begin transaction record within the transaction log.
When an aborted transaction flag is set, the UUID may be the aborted transaction's UUID and the position may indicate a position of the begin transaction record within the transaction log. This may be the same scheme, e.g., as a scheme applied when a transaction is committed.
The no outstanding transactions flag may be set, e.g., during commit or abort when there are no outstanding transactions left to commit or abort. This may act as a checkpoint flag, enabling error recovery to quickly converge when this flag is set. For example, error recover may stop searching for transaction pairings once this flag is encountered.
Time stamp may record the time in ticks, or wall clock time when the operation occurred. Among others, tick may be recorded via a lamport timestamp. Wall clock time may indicate, e.g., milliseconds since the epoch.
Errors may occur in any of the files of the datastore. A common error may comprise an incomplete write. This error damages the last record in a file. When this occurs, affected transactions may be detected and rolled back. For example, such affected transactions may comprise transactions within a single datastore or transactions spanning multiple datastores. Error detection and correction within a datastore may provide the last valid group operation position within its LRT file. Given this LRT position, any transaction within the transaction log after this position may be rolled back, e.g., as the data for the transaction may have been lost. If the data for the transaction spans multiple datastores, the transaction may be rolled back across datastores. In this aspect, the transaction log may indicate the datastores to be rolled back. For example, the transaction log may indicate the datastores to be rolled back by file UUID and position.
A transaction in progress may have, e.g., named save points. Save points may enable a transaction to roll back to a previous save point without aborting the entire transaction. Additionally, save points can be released and their changes can be aggregated to an enclosing save point or to a transaction context.
Nested transactions may have, e.g., implicit save points. When a nested transaction is rolled back, the operations and state of the nested transaction may be rolled back. For example, this may not roll back the entire enclosing transaction. A rollback all operation may enable the rollback of all transactions comprised with the nested transaction.
Streamlined transactions may have any of the following features: (1) single-threaded, (2) zero-copy, (3) single-buffered, and (4) minimal per-transaction latency.
When a transaction is committed and synchronous durability is desired, the commit operation may be configured to not return until after the transaction's state is written to persistent storage. When transactions are streamlined, this implies that a Sync may be performed after every transaction write. This approach may have a large performance impact.
Asynchronous IO may provide better performance when transactions are streamlined. When this mode is used, transaction writes may not force synchronization with the file system.
Pipelined transactions may comprise any of a multi-threaded, a double-buffered, providing maximal throughput, and adding latency to overlapping commits when synchronous IO is used. When a transaction is committed and synchronous durability is desired, the commit operation may be configured to not return until after the transaction's state is written to persistent storage. This may require, e.g., a Sync operation to force information out of memory buffers and on to persistent storage.
One approach may involve a Sync operation immediately after each commit operation. However, this approach might not scale well and may reduce system throughput. Thus, another approach may comprise transaction pipelining. This approach may be applied to transactions that overlap in time. Commits may be serialized, but may be configured to not return until there is a Sync operation. At that time, all pending commits may return. Using this approach, the cost of the Sync operation may be amortized over many transactions. Thus, individual transaction commits may not return, e.g., until a transaction state is written to persistent storage. Such transaction pipelining may comprise either synchronous IO or asynchronous IO.
In an alternate aspect, asynchronous IO may enable a transaction to be buffered at both the application and operating system layers. Each commit may return, e.g., as soon as the transaction's data is written to write buffers.
While aspects of this invention have been described in conjunction with the example aspects of implementations outlined above, various alternatives, modifications, variations, improvements, and/or substantial equivalents, whether known or that are or may be presently unforeseen, may become apparent to those having at least ordinary skill in the art. Accordingly, the example illustrations, as set forth above, are intended to be illustrative, not limiting. Various changes may be made without departing from the spirit and scope hereof. Therefore, aspects of the invention are intended to embrace all known or later-developed alternatives, modifications, variations, improvements, and/or substantial equivalents.
The present application for patent claims priority to Provisional Application No. 61/638,886 entitled “METHOD AND SYSTEM FOR TRANSACTION REPRESENTATION IN APPEND-ONLY DATASTORES” filed Apr. 26, 2012, the entire contents of which are hereby expressly incorporated by reference herein. The present application for patent is related to the following co-pending U.S. patent applications: U.S. patent application Ser. No. 13/781,339, entitled “METHOD AND SYSTEM FOR APPEND-ONLY STORAGE AND RETRIEVAL OF INFORMATION” filed Feb. 28, 2013, which claims priority to Provisional Application No. 61/604,311 entitled “METHOD AND SYSTEM FOR APPEND-ONLY STORAGE AND RETRIEVAL OF INFORMATION” filed Feb. 28, 2012, the entire contents of both of which are expressly incorporated by reference herein; andProvisional Application No. 61/613,830 entitled “METHOD AND SYSTEM FOR INDEXING IN DATASTORES” filed Mar. 21, 2012, the entire contents of which are expressly incorporated by reference herein.
Number | Date | Country | |
---|---|---|---|
61638886 | Apr 2012 | US |