The present disclosure generally relates to databases and, specifically, techniques for the implementation of protective validation techniques of a predictive concurrency control protocol to maintain the serializability of concurrent operations in such databases.
In databases, concurrency control protocols ensure correct results for concurrent operations are generated as quickly as possible. Typically, a concurrency control protocol provides rules and methods typically applied by the database mechanisms to maintain the consistency of transactions operating concurrently and, thus, the consistency and correctness of the whole database. Introducing concurrency control into a database would apply operation constraints which typically result in some performance reduction. Operation consistency and correctness should be achieved as efficiently as possible without reducing the database's performance. However, a concurrency control protocol can require significant additional complexity and overhead in a concurrent algorithm compared to a simpler sequential algorithm.
A concurrency control protocol can be implemented in database management systems, transactional objects, and distributed applications. Such a protocol is designed to ensure that database transactions may be performed concurrently without violating the data integrity of the respective databases. Thus, concurrency control is an essential element for correctness in any database system where two database transactions or more, executed with time overlap, can access the same data, e.g., in virtually any general-purpose database system. There are different approaches to implementing a concurrency control protocol (or mechanism) in databases. The main approaches may be categorized as optimistic approaches and pessimistic approaches.
In some optimistic approaches, a check for whether a transaction meets the isolation and other integrity rules (e.g., serializability) is typically performed when the transaction ends, without blocking any of the transaction's operations. Other optimistic approaches check whether a transaction meets the isolation and other integrity rules (e.g., serializability), without blocking any of the transaction's operations. When the isolation of the transaction is violated, the transaction is aborted. An aborted transaction may be immediately restarted and re-executed, which incurs an overhead. As such, if too many transactions are aborted, the optimistic approach may be disadvantageous. In a pessimistic approach, an operation of a transaction is blocked when such an operation may cause a violation of consistency rules. In such cases, the operation is blocked until the possibility of violation of the transaction clears. The disadvantage of blocking operations involves performance reduction.
Different approaches for concurrency control in databases provide different levels of performance. The selection of the best-performing approach may be based on the type of transactions, the required performance, the type of databases, and the applications accessing the database. However, the selection and knowledge about trade-offs are not always available, and thus the implemented concurrency control approach may not be selected to provide the highest performance.
Further, some databases are designed where Atomicity, Consistency, Isolation, Durability (ACID) requirements are relaxed. In such databases, as multiple transactions can execute concurrently and independently of each other, such transactions may overlap in their access to data. This could result in various inconsistencies. One method to ensure isolation between transactions and serialization in execution is by means of a well-designed concurrency control protocol.
Furthermore, existing concurrency control protocols are not efficient for transactions that include one or more predicates. Specifically, such protocols require placing locks or pausing the execution of transactions regardless of the states of the transactions' predicates. In databases, a predicate is a conditional (i.e., Boolean) expression that returns TRUE or FALSE. Predicates are commonly used in statements sent to databases, and are often an inherent part of the database statement syntax or language. For example, a common usage of predicates would be to conditionally modify a data-cell(s) based on a condition that is based on data-cell(s). Another use of predicates in a relational database is when selecting one or more rows in a table. The selected rows are those for which the predicate evaluation, based on the contents of the row, returns TRUE. These selected rows can then be further acted upon.
It would, therefore, be advantageous to provide an improved concurrency control protocol for optimizing the performance of databases when executing transactions with predicates.
A summary of several example embodiments of the disclosure follows. This summary is provided for the convenience of the reader to provide a basic understanding of such embodiments and does not wholly define the breadth of the disclosure. This summary is not an extensive overview of all contemplated embodiments, and is intended to neither identify key or critical elements of all embodiments nor to delineate the scope of any or all aspects. Its sole purpose is to present some concepts of one or more embodiments in a simplified form as a prelude to the more detailed description that is presented later. For convenience, the term “some embodiments” or “certain embodiments” may be used herein to refer to a single embodiment or multiple embodiments of the disclosure.
A system of one or more computers can be configured to perform particular operations or actions by virtue of having software, firmware, hardware, or a combination of them installed on the system that in operation causes or cause the system to perform the actions. One or more computer programs can be configured to perform particular operations or actions by virtue of including instructions that, when executed by data processing apparatus, cause the apparatus to perform the actions.
In one general aspect, a method may include receiving a statement that is a part of a transaction; determining if the received statement includes at least one predicate; when the received statement includes at least one predicate, performing, for each of the at least one predicate: identifying at least one table-node participating in the execution of a task associated with the received statement; instantiating, by each of the at least one table-node, a conditional read vector (RV)-entry for the at least one predicate; executing the task included in the received statement, where the task includes evaluation of the predicate; and upon receiving a commit statement that is part of the transaction, returning to a client an acknowledgment that the transaction is committed, where the acknowledgment is returned upon validation of the transaction, and where during validation of the transaction, reading-transactions conflicting with the transaction are detected using at least conditional RV-entries instantiated by the at least one table-node. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.
In one general aspect, a system may include one or more processors configured to receive a statement that is a part of a transaction. System may furthermore include determining if the received statement includes at least one predicate. System may in addition include when the received statement includes at least one predicate, perform, for each of the at least one predicate: identifying at least one table-node participating in the execution of a task associated with the received statement. System may also include instantiate, by each of the at least one table-node, a conditional read vector (RV)-entry for the at least one predicate. System may furthermore include executing the task included in the received statement, where the task includes evaluation of the predicate. The system may in addition include upon receiving a commit statement that is part of the transaction. System may moreover include return to a client an acknowledgment that the transaction is committed, where the acknowledgment is returned upon validation of the transaction, and where during validation of the transaction, reading-transactions conflicting with the transaction are detected using at least conditional RV-entries instantiated by the at least one table-node. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.
The subject matter disclosed herein is particularly pointed out and distinctly claimed in the claims at the conclusion of the specification. The foregoing and other objects, features, and advantages of the disclosed embodiments will be apparent from the following detailed description taken in conjunction with the accompanying drawings.
It is important to note that the embodiments disclosed herein are only examples of the many advantageous uses of the innovative teachings herein. In general, statements made in the specification of the present application do not necessarily limit any of the various claimed embodiments. Moreover, some statements may apply to some inventive features but not to others. In general, unless otherwise indicated, singular elements may be in plural and vice versa with no loss of generality. In the drawings, numerals refer to like parts through several views.
Some example embodiments provide a predictive concurrency control protocol (CCP) with protective validation techniques implemented into a database system (or simply a database). According to the disclosed embodiments, consistency of transactions, by means of the disclosed predictive CCP, is achieved through isolating transactions and adopting different approaches during the execution phases of a transaction. In an embodiment, an optimistic approach is implemented during the working-phase of a transaction to allow the operation of multiple transactions to run independently without blocking or locks. For validation of a transaction, a pessimistic approach is taken, where a validating transaction may wait for other transaction(s) to commit, and where, under some circumstances, a transaction that evaluated predicates may not block other validating transaction(s) from committing. This is achieved by predicting the value of such predicates in a transaction being validated. The prediction of values of such predicates for data-cells is achieved, in some embodiments, through an epsilon checking procedure, discussed in detail below. According to the disclosed embodiments, various protective declarations are used to block other transactions from modifying the contents of the data-cells when it is determined by the epsilon checking procedure that a transaction may commit early. Additional disclosed embodiments include techniques for efficiently handling conflicts and deadlocks between transactions. As a result of these embodiments, significantly fewer transactions are aborted in comparison to a known implementation of an optimistic concurrency control protocol, thereby improving the overall performance of the databases. Further, significantly more transactions can be executed and committed in parallel than with a known implementation of an optimistic concurrency control protocol. Thus, the disclosed embodiments allow for higher parallelism in the transaction working-phase, execution, and validation-phases.
As such, the disclosed techniques allow for the fast execution of transactions and the processing of more transactions at a given time period. Therefore, the disclosed embodiments provide a technical improvement over current database systems that, in most cases, fail to serve applications that require fast and parallel execution of transactions for retrieval and modification of datasets. The disclosed embodiments can be implemented in database systems as well as in data management systems, such as an object storage system, a key-value storage system, a file-system, and the like.
Each client 110 is configured to access the database 120 through the execution of transactions. A client 110 may include any computing device executing applications, services, processes, and so on. A client 110 can run a virtual instance (e.g., a virtual machine, a software container, and the like).
In some configurations, clients 110 may be software entities that interact with the database 120. Clients 110 are typically located in compute nodes that are separate from database 120 and communicate with database 120 via an interconnect or over a network. In some configurations, an instance of a client 110 can reside in a node is part of the database 120.
The database 120 may be designed according to a shared-nothing or a shared-everything architecture. The transactions to database 120 are processed without locks placed on data entries in database 120. This allows for fast processing retrieval and modifications of data sets.
A transaction is issued by a client 110, processed by the database 120, and the results are returned to the client 110. A transaction typically includes the execution of various data-related operations over the database system 120. These operations are often originated by clients 110. The execution of such operations may be short or lengthier. In many cases, operations are independent and unaware of each other's progress.
A transaction can be viewed as an algorithmic program logic that potentially involves reading and writing various data-cells. A transaction, for example, may read some data-cells through one data operation, and then, based on the values read, can decide to modify other data-cells. That is, a transaction is not just an “I/O operation” but is more of a “true” computer program. A data cell is one cell of data. Data cells may be organized and stored in various formats and ways. Data cells, defined below, may be contained in files or other containers, and can represent different types (integer, string, and so on).
An execution of a transaction may be shared between a client and the database 120. For instance, in an SQL-based relational database, a client 110 interacts with the database using SQL statements. A client 110 can begin a transaction by submitting a SQL statement. That SQL statement is executed by the database 120. Depending on the exact SQL statement, the database 120 performs various read and/or write operations as well as invokes algorithmic program logic typically to determine which (and whether) data-cells are read and/or written. Once that SQL statement completes, the transaction is generally still in progress. The client 110 receives the response for that SQL statement and potentially executes some algorithmic program logic (inside the client node) that may be based on the results of the previous SQL commands, and as a result of that additional program logic, may submit an additional SQL statement and so on and so forth. At a certain point, and once the client 110 receives an SQL statement response, the client can instruct the database 120 to commit the transaction.
It should be noted that a client 110 can submit a transaction as a whole to the database 120, and/or submit multiple statements for the same transaction together, and/or submit a statement to the database 120 with an indication for the database to commit after the database 120 completes the execution of that statement.
It should be further noted that transactions may be abortable by the database 120 and/or a client 110. Often, aborting a transaction clears any of the transaction's activities.
For the sake of simplicity and ease of description, the following description would refer to a transaction initiated and committed by a client, and statements of the transaction are performed by the database 120. A transaction may include one or more statements. A statement may include, for example, an SQL statement. One of the statements may include a request to commit the transaction. In order to execute such a statement, the database may break the statement execution into one or more tasks, where each such task is running on a node. With this modeling, a task does not execute on more than a single node, but multiple tasks of the same statement can execute on the same node if needed. A task is an algorithmic process that may require the execution of read operation(s) and/or write operations(s) on data cells.
As defined herein without any limitation, a “writing-transaction” refers to a transaction that writes data-cells. A writing-transaction may also read data-cells. Note that any write-only transaction is also a writing-transaction, but the opposite is not correct. Reading-transaction” refers to a transaction that reads data-cells. A reading-transaction can also write data-cells. It should be noted that any read-only transaction is also a reading transaction, but the opposite is not correct. A validating-transaction is a transaction being validated.
As part of its execution, a statement may evaluate one or more predicates. A predicate is a conditional (i.e., Boolean) expression that returns TRUE or FALSE. Predicates are commonly used in statements sent to databases and are often an inherent part of the database statement syntax or language. For example, a common usage of predicates would be to conditionally modify a data-cell(s) based on a condition (predicate) that is based on data-cell(s).
As an example, consider the following data-cells: john_hair_color. john_profession, john_salary, john_start_date; and the following a statement:
The predicate is the IF expression and can return TRUE if john is both a software engineer AND started to work earlier than 2010, or FALSE, otherwise. The conditional actions are setting john_profession to a senior software engineer and raising his salary by 10%.
A statement evaluating predicates may consider the value of “Predicate Data-Cells” which are data-cells that were used to calculate the predicate. In the above example, those are john_profession and john_start_date. Another way to term this would be that the predicate is evaluating a single Data-Cell Set, where that data-cell set is (john_profession, john_start_date).
In databases, a statement can be executed on a single, specific row, where that statement involves a predicate (or multiple predicates), where each predicate evaluates a single data-cell set that is often associated with that row.
In addition, in relational databases, as well as in some non-relational databases, it is also possible to perform a statement on a set of rows where the specific identity of the rows is not explicitly known. Instead, the rows are selected according to various criteria and are often selected by a predicate.
For example, in a relational database with an employee table (a row represents each employee), the following SQL statement is performed: “For all the employees that have a profession of software_engineer and started to work in the company earlier than 2010, modify their profession to senior_software_engineer and raise their salary by 10%”. It should be noted that the SQL statements provided herein are not in their proper SQL syntax.
In that case, the scope of the statement is the entire table, and so is the scope of the predicate. While the predicate data-cells are actually the entire profession and start_date columns (i.e., all the corresponding cells for all the rows in the table), the predicate operates, each time, on a separate data-cell set. Such a data-cell set would be, for example, the cells: John's profession and John's start_date. The predicate will also operate on Betty's profession and Betty's start_date (yet another relevant data-cell set). However, inherently, according to the statement semantics, the predicate will not operate on John's profession together with Betty's start_date.
A transaction may be executed over the database 120 in three phases: working, validation, and commit. In some configurations, a transaction may be executed over the database 120 in two phases: working and commit. The embodiments carried by the disclosed predictive concurrency control protocol in each phase are discussed in great detail below.
In an embodiment, the database 120 is a distributed database and may be realized as a relational database system (RDBMS) or a non-relational database. As will be demonstrated in
The disclosed embodiments will be discussed with reference to a distributed database configuration. However, the disclosed embodiments may also be applicable to a non-distributed configuration of a database.
In one embodiment, the nodes 210, and hence the database 120 are designed with a shared-nothing architecture. In such an architecture, nodes 210 are independent and self-sufficient as they have their own disk space and memory. As such, in the database 120, the data is split into smaller sets distributed across the nodes 210. In another embodiment, the nodes 210, and hence the database 120 are designed with a shared-everything architecture where the storage is shared among all nodes 210.
The data managed by the database can be viewed as a set of data-cells. While the most natural form of those data-cells would be items, such as what relational databases refer to as “column cells”, those data-cells can actually be any type of data, data-object, file, and the like.
Databases often organize a higher level of a data object referred to as data-row (or simply row). A data-row may include a collection of specific data-cells. For example, in relational databases, a set of rows form a database table. The data-cells contained by a specific row are often related to one “entity” that the row describes. In relational databases, the concept of a data-row is inherent to the data-model (i.e., one of the foundations of the relational data-model is processing “data tuples” that are effectively data-rows). Often, data-cells can be added or removed only as part of their data-row. In other words, a data-row can be added (or removed), thus adding more (or removing existing) data-cells to the database.
Typically, all the data-cells of a specific row reside in close proximity (e.g., consecutively) on the storage device, as this can ensure that multiple cells of the same row (or all the cells of the row) can be read from the disk more cheaply (e.g., with a single small disk I/O) than if those cells would each be stored elsewhere on the disk (e.g., with n disk I/Os to n different disk locations in order to retrieve n cells of the same row). Further, the metadata for managing the data-cell information may also be organized in a rougher resolution as it may result in meaningfully lesser and smaller overall metadata.
In some embodiments, a specific data-row can be viewed as if it exists and just contains a single specific data-cell. In one configuration, and without limiting the scope of the disclosed embodiments, a single cell, and a single row may reside in a specific storage device of a node 210. However, it should be noted that a row can be divided across multiple nodes. It should be further noted that the disclosed embodiments can be adapted to operate in databases where data cells are stored and arranged in different structures. In some embodiments, where a row is divided across multiple nodes, the “sub row” that is stored under a single node and/or storage device could be treated as a data-row.
In another embodiment, and without limiting the scope of the disclosed embodiments, the database may also store various pieces of data, in addition to the data-cells and data-rows, including, but not limited to, any and all metadata, various data structures, configuration information, a combination thereof, and the like (hereinafter “metadata”). Additionally, in an embodiment, and without limiting the scope of the disclosed embodiments, the database may also store index information that may be used, for example, for faster searching of data-rows.
In one configuration, the database 120 may maintain at least one index-node and at least one table-node. An index-node is a node (e.g. node 210-1) in which an index, or a part of an index, is stored. A table-node is a node (210-n) in which at least one data-row of a data table is stored. An index-node may also be a table-node, e.g., an index of a data table and data-rows of a data table may reside on the same node. Additionally, the table-nodes and index-nodes may contain data-rows of other tables or may contain other indexes, associated with the same table or with different tables. In some configurations, a node may include an index and data (data-rows).
For simplicity and clarity purposes, index-nodes and table-nodes may be referred to with unique names based on the context of the various disclosed embodiments, e.g., profession index-nodes are all the index-nodes that store a profession index (e.g., of an employee table), and employee-table table-nodes are all the table-nodes that store rows of an employee-table. In the various disclosed embodiments, using indexes allows for faster processing of transactions, e.g., for faster searching of data-rows.
An index is defined as a data structure that maps index values to data-rows of a certain data table. In an embodiment, an index maps an index value to a row ID or equivalent row pointer. In the various embodiments disclosed herein, it is assumed that an index value maps to a row ID unless stated otherwise. In one configuration, an index may correspond to a data column in a data table, e.g., profession, salary, hair color, etc. For example, in a profession index of an employee table, the profession index maps each profession that currently exists for at least one of the existing employees in the employee table, to the row (employee), or rows (employees) that have that profession. For example, a profession index value may be a software engineer. In this case, the profession index may be used for searching all employee table rows having a given profession value, or set of profession values. The index value of software engineer maps to all the row IDs in the employee table where the profession is a software engineer. When a data-cell is written, the index that corresponds to the column of the written data-cell, if such an index exists, is updated accordingly.
In an embodiment, an index may be a unique index. A unique index ensures that the indexed column contain only distinct values, e.g., no two rows can have the same value in that column. In another embodiment, an index may be a non-unique index. A non-unique index allows for duplicate values in the indexed column.
In one configuration, indexes are implemented using persistent B+trees. In another configuration, indexes are implemented using persistent hash-tables. According to the various disclosed embodiments, indexes are assumed to be implemented as persistent B+trees unless stated otherwise.
In a distributed database, an index may be distributed or non-distributed. A non-distributed index means that the index is stored entirely on only one index-node, and covers all the rows that are stored in the pertinent table, where the table's rows may be stored in that index-node, in another node, or distributed across multiple nodes. A distributed index means that the index spans multiple index-nodes, that is, its content is divided and stored on multiple index-nodes. For example, the profession index may be implemented as a distributed B+Tree, distributed across index-nodes N30, N31 and N32, where index information for some professions is stored in node N30, some in node N31 and some in node N32. Such a B+Tree would normally be ordered by profession (e.g., ordered by the profession ID which is, for example, a numerical value). Different indexes (whether distributed indexes or non-distributed indexes) that index columns of the same table are separate entities and can be distributed on different index-nodes. Although an index and a data-row of a data table may reside on the same node, an index value in that index for a particular row ID is not necessarily stored on the same node as the particular row ID is stored.
In some embodiments, an operation of a task may access a single data cell in a single node 210. Furthermore, multiple operations (of the same or different transactions) may access the same data cells simultaneously or substantially at the same time. There is no synchronization when such operations, tasks, or statements of a transaction or transactions are performed. In a typical computing environment, hundreds of concurrent transactions can access the database 120. As such, maintaining and controlling the consistency of transactions and their operations is a critical issue to resolve in databases.
In an embodiment, each node 210 includes an agent 215 configured to manage access to data stored on the respective node. It should be noted that an agent 215 can operate in an index-node or a table-node. In an index-node, agent 215 may perform operations related to inserting, deleting, or updating an index-entry of an index residing in the respective node. The agent 215 may be realized in hardware, software, firmware, or combination thereof. Software shall be construed broadly to mean any type of instructions, whether referred to as software, firmware, middleware, microcode, hardware description language, or otherwise. Instructions may include code (e.g., in source code format, binary code format, executable code format, or any other suitable format of code).
The agent 215 is configured to manage the contents of data cells and operations within a node. For example, when a write operation requires access to data cell(s), the agent 215 may be configured to point to the requested cell(s). In an embodiment, each transaction is managed by a transaction manager 217. A transaction manager 217 is an instance executed over one of the nodes 210 and configured to orchestrate the execution of a transaction across one or more nodes 210. The transaction manager 217 may be instantiated on any node 210. In the example shown in
It should be noted that a transaction manager 217 can reside in any node, including a node that is an index-node or node that is a table-node. It should be further noted that the transaction managers 217 and agents 215 are logical entities that may reside on different nodes, and allow to manage execution of transactions across multiple nodes.
In an embodiment, and for brevity and clarity, the following notation may be used: ‘TRx-AGnn’ is the transaction agent that is instantiated on nlode ‘nn’, in the context of operation(s) that that specific transaction agent performs on behalf of transaction ‘TRx’. For example, TR1-AG30 may refer to the transaction agent instantiated on node 30, in the context of operations it performs for transaction TR1. This disclosure may refer to such an entity as ‘the transaction agent of TR1 in node N30’. It should be noted that TR1-AG30 and TR2-AG30 may be referred to as ‘the transaction agent of TR1 in node N30’ and ‘the transaction agent of TR2 in node N30’, respectively, where both refer to the same transaction agent 215 instantiated on node N30 (but in the context of the specific operations it performed on behalf of the said transaction). It should also be noted that the term ‘agent’ and ‘transaction agent’ are used herein interchangeably.
The disclosed embodiments are related to changes and operations on table-nodes. Therefore, nodes discussed hereinafter are “table-nodes”.
The transaction manager 217 can be realized as a software object, a script, and the like executed over the hardware of a respective node 210. It should be noted that multiple transaction managers may be executed on one node or multiple nodes, where each transaction manager handles a single transaction.
The operation of a transaction manager 217 carries through three phases: working, validation, and commit. In the working-phase, one or more statements of a transaction are processed. In the validation-phase, all data cells that have been written through the transaction are validated. In the commit-phase, the entire transaction is committed.
The disclosed embodiments provide a predictive CCP with protective validation techniques implemented into a distributed database, including a technique denoted as the ‘complete-instantiation-technique’. The predictive CCP allows, in some cases, the early commit of validating transactions. That is, in some cases, a validating writing-transaction (hereinafter “validating-transaction TR1”) may progress to commitment even when it modified a data-cell, or multiple data-cells that were read by a reading-transaction (hereinafter “reading-transaction TR101”) that has not yet committed.
The disclosed predictive CCP is an expansion of a CCP discussed in the above-referenced Ser. No. 18/341,279 application. In general, optimistic CCP approaches are non-blocking, but tend to abort transactions upon the detection of conflicts, and usually require the detection of read/write, write/write, and write/read conflicts. As opposed to conventional optimistic CCP approaches, the disclosed embodiments are more tolerant, as the predictive CCP requires only the detection of read/write conflicts. Further, the predictive CCP allows, under some cases, ignoring read/write conflicts that cannot be ignored by conventional optimistic CCP approaches.
Furthermore, according to the disclosed predictive CCP, even if the read/write conflict cannot be ignored, transactions that participate in such a conflict will generally not abort. Instead, in the disclosed protocol, dependencies among such transactions will alter the order of commitments. Any such blocking during validation-phase is performed only after a validating transaction has already completed its working-phase and thereby released the resources that were required for its execution. In that respect, such a blocking would use meaningfully fewer resources than a blocking by a conventional CCP. Furthermore, in database environments, the realization of these dependencies is generally simple and consumes minimal resources.
It should be noted that as would also apply to conventional pessimistic and optimistic CCPs, the predictive CCP is not immune from inter-transaction deadlocks. In the case where an inter-transaction deadlock is detected, one transaction out of the deadlock cycle would be aborted. Techniques for handling deadlocks, including deadlock detection and deadlock prevention techniques, are beyond the scope of the present disclosure.
It should be further noted that the disclosed predictive CCP allows for the performance of a higher degree of parallelism in transaction execution relative to pessimistic solutions while maintaining the same state of the database at the end of processing such transactions as if the transactions were executed in serial. This allows for the fast execution of transactions and the processing of more transactions at a given time period. Therefore, the disclosed embodiments provide a technical improvement over current database systems that, in most cases, fail to serve applications that require fast and parallel execution of transactions for retrieval and modification of datasets. The disclosed predictive CCP can be implemented in database systems as well as in data management systems, such as an object storage system, a key-value storage system, a file-system, and the like.
As briefly mentioned above, in the predictive CCP, in some cases, a validating-transaction TR1 (i.e., a transaction that is in a validation-phase) that has modified a data-cell (or a set of data-cells) previously read by an existing reading-transaction TR101 may be enabled to commit even prior to the completion of reading-transaction TR101, while maintaining serializability and other expected consistency properties. This enablement improves the concurrency of the transaction execution. In contrast, it should be noted that in some CCPs disclosed in the related art, validating-transaction TR1 would always be dependent on reading-transaction TR101's completion and would not be able to commit prior to the completion of reading-transaction TR101.
It should be noted that the above-mentioned cases that allow such an earlier commitment have to do with cases where reading-transaction TR101 evaluated a predicate as part of its execution. As mentioned above, a predicate, as discussed in the related art, may be defined as a part of a transaction statement within a database that describes a condition upon which an action may commence. As a non-limiting example, a transaction enacted on a single row in a database may be colloquially described as the following directive: “If John's profession is a software engineer and John's start date is before Jan. 1, 2010, then increase John's salary by ten percent and update John's profession to senior software engineer.” For such a transaction, the predicates are the variables included in the “if” clause, namely “John's profession” and “John's start date”. In contrast, the actions are the steps taken in the “then” clause, namely the increase in John's salary and the update to John's profession.
It should be noted that, as previously discussed, predicates can also be used as part of a statement that selects one or multiple rows that satisfy a predicate. For example, in such a statement where the predicate data-cells are “profession” and “start date”, the predicate data-cell set may comprise “[Jane's profession, Jane's start_date]”, “[John's profession, John's start_date]”, and so on.
In an embodiment, reading-transaction TR101 may have read a relevant data-cell set as part of a predicate evaluation, where, after the predicate returned TRUE or FALSE, the actual concrete contents of the read data-cells that were used for the predicate evaluation are not further used by reading-transaction TR101. In such cases, if a validating-transaction TR1 modifies one or more of those predicate data-cells in a way that will not affect the result of the predicate, then, with some further conditions fulfilled, validating-transaction TR1 may consider itself not dependent on that specific reading-transaction TR101's predicate evaluation (and its associated reads). In this specific example, if no other dependencies of validating-transaction TR1 on reading-transaction TR101 are detected, validating-transaction TR1 may commit before reading-transaction TR101's commitment, i.e., validating-transaction TR1 is not dependent on reading-transaction TR101.
The disclosed embodiments ensure that the predictive CCP implemented into a distributed CCP maintains the principles that the main validation work done by a transaction agent (of, say, a validating writing transaction TR1), to be as local-to-the-node as possible. This is performed using a conditional read vector entry (RV-entry) discussed in detail below.
It should be noted that the improvement in commitment efficiency described above can be meaningfully beneficial. For example, in relational databases as well as other databases, there are direct ways to access specific cells of specific rows (e.g., by specifying a row ID, a primary index, etc.). However, there are (for example) SQL statements with a broader scope where such a statement acts upon a set of row(s) that are selected by evaluating a predicate. The table rows that satisfy the predicate are the ones that are affected by the statement. The predicate evaluation is either done by a full (data) scan, by index searches or by a combination of index searches and data scans.
From a general serializable CCP perspective (i.e., without the mechanisms described by this disclosure), such a predicate-based search (e.g., performed by a reading-transaction TR101) is generally analogous to reading all the predicate data-cells of all the rows in the table (e.g., of the entire columns related to the predicate), even if only some or even very few of the rows answer the predicate and are actually used by reading-transaction TR101. That may meaningfully limit the concurrency in transaction execution, as it may create many conflicts with other transactions. For example, a validating-transaction TR1 that modified pertinent data-cells in a couple of rows that were not selected by reading-transaction TR101 may, in many cases, be blocked due to reading-transaction TR101, despite the fact reading-transaction TR101 did not select these couple of rows. Therefore, the disclosed embodiments provide mechanisms that minimize such dependencies whenever possible.
At S310, at least one statement that is part of a transaction initiated by a client is received. A transaction may include a collection of statements, each of which may include a collection of tasks. A task may require the execution of read operation(s), write operation(s), or both. A task may be a program or logic typically executed by an agent. A read operation requires reading data from a data cell, while a write operation requires writing data to a data cell. A statement may include a commit statement, thereby committing the transaction.
At S320, it is checked if a received statement is a commit statement, and if so, execution continues with S340; otherwise, execution continues with S330.
At S330, nodes (210,
In an embodiment, the statement involves a predicate evaluation, where the database performs the predicate evaluation by accessing all the table-nodes containing rows of the associated table. For example, there is an employee table that is stored across table-nodes N20, N21, N22, N23, and N24. That is, N20-N24 are the employee-table-nodes. A statement of reading-transaction TR101 evaluates a predicate that selects all employees with red/orange hair-color that are of height>180 cm. In that example, the database evaluates the predicate by accessing all the employee-table-nodes and, in each such an employee-table-node, by scanning the employee-table, evaluating the predicate by reading the pertinent cells in the data-rows.
In another embodiment, the statement involves a predicate evaluation, where the database may perform the predicate evaluation without necessarily needing to read rows in all the associated table's table-nodes. For example, incorporating an index search for the predicate evaluation may result in such a scenario. The example uses the above example scenario of reading-transaction TR101, but with the following change: The database evaluates the predicate by incorporating a hair-color index-search for all employees with red/orange hair-color resulting in a row-ID list of those rows. Then, the predicate evaluation continues by reading the resulted rows, to further narrow down the rows to be selected by the predicate, by checking the corresponding height column, selecting those with height>180 cm. In this example, the row-ID list resulted by the index-search covered only rows of nodes N20 and N21 because the other employee-table-nodes (N22, N23, and N24) do not contain employees with red/orange hair-color. The table-nodes (of the pertinent table associated with the predicate) that contain rows of the resulted row-ID list may be denoted as ‘covered-nodes’ where the other table-nodes (of the pertinent table associated with the predicate) that do not contain rows of the resulted row-ID list may be denoted as ‘uncovered-nodes’. S330 further includes identifying all the employee-table-nodes, including the covered-nodes as well as the uncovered-nodes. In the last-mentioned example, that means that S330 identifies N20 and N21 (covered-nodes) as well as N22, N23, and N24 (uncovered-nodes). A node (N) may be any node of discussed with reference to
It should be noted that a statement may include a plurality of predicate evaluations, that may be of the same or different tables. The identification of all those table-nodes, as described hereinabove, is performed at S330.
At S335, the tasks are sent to the agents residing in the determined (identified) nodes. Such nodes process the tasks that are part of a received statement. In an embodiment, a list of the determined agents participating in the working phase of the statement is maintained. Further, for each such agent, it is determined if the agent performs at least one write operation to a data-cell that is part of a table's data-row, during the entire execution of a transaction. The execution of such operations by an agent (215) during a working phase is discussed below. It should be noted that S330 and S335 may be performed iteratively as part of the execution of one task, when it is determined that another task is required. It should be further noted S330 and S335 may be performed in parallel or, at certain times, at a different order.
In an embodiment, according to the complete-instantiation-technique, each identified table-node (e.g., N20, N21, N22, N23, and N24, in the above example scenario) for a predicate evaluation instantiates a conditional RV-entry. That is, a conditional RV-entry representing the pertinent predicate evaluation by the reading-transaction TR101 is instantiated on each of the identified table-nodes (covered-nodes as well as uncovered-nodes) associated with the corresponding table. Such agents for covered-nodes (for example TR101-AG20 and TR101-AG21 in the scenario) may each be instantiating that conditional RV-entry as part of their procedure for reading data-cells, as discussed in detail hereinbelow. Additionally, agents for uncovered-nodes (for example TR101-AG22, TR101-AG23, and TR101-AG24, in the above example scenario) are each called to instantiate that conditional RV-entry despite the fact these agents do not perform data-cell reads for the purpose of the corresponding predicate evaluation. Conditional RV-entries are discussed in detail hereinbelow.
It should be noted that adding a conditional RV-entry to reading-transaction TR101's RV in each of the identified table-nodes allows a validating-transaction (e.g., TR1) to detect a conditional RV-entry of a conflicting reading-transaction. A RV-entry designates the entire predicate evaluation, including information describing the predicate that is evaluated. A single conditional RV-entry may represent a predicate evaluation of a single data-cell set or of multiple data-cell sets. A predicate evaluation of multiple data-cell sets may occur, for example, if the scope of the predicate contains multiple rows or the entire set of rows of a table. RVs, conditional RV-entries, and non-conditional RV-entries are discussed in detail hereinbelow.
The following example includes an index search (e.g., by a statement of reading-transaction TR101 that evaluates a predicate that selects all employees with red/orange hair-color that are of height>180 cm) where N20 and N21 are covered-nodes and N22, N23, and N24 are uncovered-nodes, and Jane is an employee whose row is stored in employee-table-node N22, where Jane's hair-color is brown and her height is 190 cm. Concurrently executing writing-transaction TR1 modifies Jane's hair-color to red. If the corresponding TR101's conditional RV-entry for the above-mentioned predicate evaluation would be instantiated only in the covered-nodes (N20 and N21), TR1-AG22 cannot detect this conditional RV-entry as part of its validation because such an RV-entry wouldn't be instantiated in N22. Thus, adding a conditional RV-entry to reading-transaction TR101's RV in each of the table-nodes N20 through N24 (in both the pertinent covered-nodes and uncovered-nodes) allows TR1-AG22 to detect a pertinent conditional RV-entry in N22 that indicates a conditional conflict between the validating-transaction TR1 and the reading-transaction TR101. Conditional conflicts are discussed in more detail hereinbelow.
At S337, at the end of the execution of all tasks associated with the received statement, a response is sent back to the client with the results of the processing of the statement. Then, execution returns to S310.
Execution reaches S340 when a commit statement is received from the client. At this stage, a validation request is sent to every agent that performed a write operation to data-cells of table's data-rows, for and during the execution of the transaction that received the commit statement. It should be noted that if no write operation has been performed, there is no validation request, and execution continues with S350. In the CCP, each agent performing a validation will take a commit pause at the end of the validation process. The commit pause is taken to enable the atomicity of the distributed transaction commitment, by preventing race conditions between the committing transaction that completed its validation and other transactions that may then attempt to read data-cells that were modified by the committing transaction. An embodiment for validating a transaction according to the predictive CCP is discussed with reference to
At S350, upon receiving validation confirmation messages from agents that performed write operations on behalf of the transaction, committed messages are sent to all agents regardless of if such agents performed write operations or not. A committed message indicates to the agent to commit the operations performed, perform various cleanups, and to release the commit pause taken during the validation phase. At S360, an acknowledgment is sent to the client that the transaction is committed. It should be noted that S350 and S360 can be performed in parallel or in a different order.
At S410, a validation request is received. In an embodiment, such a request is received by agents that performed write operations on behalf of validating-transaction TR1. If the validating-agent TR1-AG20 did not perform a write operation (for example, if TR1-AG20 only read data-cells and did not write data-cells), then, upon receiving a committed message from TR1's transaction manager, execution of TR1 will reach S440.
At S420, a validation process is performed. In some embodiments, the validation process is performed for all write operations executed by the validating-agent TR1-AG20 on behalf of validating-transaction TR1. The various embodiments of S420 are further discussed in
At S430, a validation completion message is issued. In an embodiment, such a message is sent to the transaction manager and issued by the validating-agent TR1-AG20. As noted above, when such messages are issued, commit pauses are taken to allow the transaction to commit.
At S440, upon receiving a committed message, a committed-message handling procedure is performed. In one embodiment, during this procedure, commit pauses that the validating-agent TR1-AG20 placed and all dependencies associated with the validating-agent TR1-AG20 are released. Furthermore, when a transaction that wrote to data-cells commits, the contents of the data-cells become “committed” to override data currently stored in data-cells. The various embodiments of S440 are further discussed in
The process S420 further inspects whether the validating-transaction TR1 can be enabled for an early commit over a reading-transaction (e.g., TR101) that evaluated a predicate(s) that uses data-cells that validating-agent TR1-AG20 modified. Additionally, the process S420, according to some embodiments, includes protective validation techniques described in detail hereinafter.
A validation-phase is a process within a transaction, for example, validating-transaction TR1, whereby validating-transaction TR1 checks for whether at least one reading transaction has read any of the data-cells that were modified by validating-transaction TR1. A validation-phase may occur after a working-phase and may result in a block of validating-transaction TR1 until the commitment of a reading-transaction TR101) on which validating-transaction TR1 is determined to be dependent. According to the disclosed embodiments, the validation-phase may not always determine that validating-transaction TR1 depends on reading-transaction TR101 that has read data-cells that validating-transaction TR1 modified, and hence, in some cases, may allow validating-transaction TR1 to commit before reading-transaction TR101 has committed.
Typically, the process S420 is performed when a commit statement is received from the client. At this stage, as described hereinabove, a validation request is sent to all TR1's agents that has performed a write operation to a data-cell of a table's data-row.
It should be noted that according to the disclosed embodiments, during a working-phase of a transaction (TR5), a read-vector (RV), and a write-vector (WV) may be created in each of TR5's agents. The RVs and WVs are updated and scanned during the working-phases and validation-phases to avoid situations of data conflicts. It should be noted that scanning the vectors is only one technique that can be used herein, for example, by means of lookup tables.
For example, during a working-phase of a transaction TR5 that has an agent in table-node N20 (TR5-AG20), if TR5-AG20 reads a data-cell that is not for the purpose of a predicate evaluation (non-conditional read), TR5-AG20 may then add an RV-entry (non-conditional RV-entry) to its RV. This entry designates the data-cell being read. TR5-AG20 may then read the most up-to-date committed cell contents. The process for reading data cells according to the disclosed embodiments are discussed in
In one embodiment, during a working-phase of TR5, if TR5-AG20 evaluates a predicate, the agent TR5-AG20 may add an RV-entry (conditional RV-entry) to its RV. As noted above, such an entry designates the entire predicate evaluation, including information describing the predicate that is evaluated. A single conditional RV-entry may represent a predicate evaluation of a single data-cell set or of multiple data-cell sets. A predicate evaluation of multiple data-cell sets may occur, for example, if the scope of the predicate contains multiple rows or the entire set of rows of a table.
Then, according to an embodiment, TR5-AG20 may perform the predicate evaluation of one or more data-cell sets by reading their most up-to-date committed cell contents. Such data-cell read(s) may be denoted as a “conditional read”. During a working-phase of TR5, when TR5-AG20 writes a data-cell, TR5-AG20 may add a WV-entry to its WV, designating the data-cells being written. Additionally, TR5-AG20 writes the data-cell contents in an “uncommitted manner” such that they are “private” and hence inaccessible for reading by any other transaction agent in node N20. Such a data-cell write may not override or change any elements of the currently committed data-cell contents.
At S510, the current non-conditional conflicts between the validating-agent TR1-AG20 and their related conflicting transactions are identified. In general, a conflict may be indicated by the presence of cells that were modified by the validating-agent TR1-AG20 and were read by another existing reading-transaction agent on that node (N20), TR101-AG20. It should be noted that a conflict may be denoted as existing between a writing-transaction and a reading-transaction, and a conflict can also be denoted as existing between a writing-transaction agent and a reading-transaction agent, or any combination thereof (e.g., between a transaction and an agent, etc.). All those ways of notation are equivalent and may be used interchangeably. A non-conditional conflict is defined as a conflict pertaining to a read operation by the reading-transaction agent TR101-AG20 that was not performed as part of a predicate evaluation. Such a reading-transaction may be denoted as a “conflicting transaction”. In an embodiment, all the current non-conditional conflicts between the validating-agent TR1-AG20 and other transaction-agents in N20 are identified. In one example embodiment, this identification includes iteratively scanning the WV of the validating-agent TR1-AG20 for data cells that TR1-AG20 wrote to. Further, for each such data cell, all active reading-transaction agents in N20 (except for validating-transaction TR1 itself) that read from the cell are identified. This can be performed by scanning the reading transaction agents' RVs.
At S520, each identified non-conditional conflict is processed. In an embodiment, S520 includes marking the validating-transaction TR1 (and/or its agent TR1-AG20) as dependent on each of the identified conflicting reading-transactions. The dependencies can be maintained in a data structure, such as a graph, a table, and the like, and may be stored on node N20. It should be noted that if no non-conditional conflicts are identified, S520 is skipped. S520 is described in more detail with respect to
At S530, the current conditional conflicts between the validating-agent TR1-AG20 and reading-transaction are identified. In an embodiment, the related conflicting transactions are identified. A conditional conflict is defined as a conflict pertaining to a read by a reading-transaction agent TR101-AG20 that was performed as part of and for the purpose of predicate evaluation of a specific data-cell set. That is, the conditional conflict occurs if TR1-AG20 wrote to that data-cell set. In an embodiment, all the current conditional conflicts between TR1-AG20 and other reading-transactions are identified. A read that was performed as part and for the purpose of a predicate evaluation may be denoted as a conditional read and may include the creation of a conditional RV-entry. Such a reading-transaction may also be denoted as a “conflicting transaction”. In an embodiment, a conditional RV-entry represents the entire predicate evaluation.
It should be noted that a conditional conflict is in a data-cell set granularity. For example, a reading-transaction agent TR101-AG20 evaluates a predicate PR1010 for all the rows in a table that are stored on N20. The predicate PR1010 is used to select the rows of people with red hair-color and a software-engineer profession. In this example, the validating-agent TR1-AG20 modified Jane's hair-color and also modified George's hair-color.
In that example, there are two conditional conflicts between TR1-AG20 and TR101-AG20 both for predicate PR1010. One conditional conflict is for the data-cell set [Jane's hair-color, Jane's profession], and the other conditional conflict is for the data-cell set [George's hair-color, George's profession].
At S540, each identified conditional conflict is processed. In an embodiment, each identified conditional conflict may be classified as being of a particular state. A state characterizes a particular relationship between the evaluations of a predicate before and after the commitment of a validating transaction TR1. The process of determining the state of the conditional conflict is discussed further below. As discussed below, the state may include move-in, move-out, stay-in, or stay-out. In an embodiment, the determination of each of the four states requires the execution of the epsilon checking procedure as discussed below.
In general, a move-in state describes the following situation: R5 is the row containing the data-cell set related to the pertinent conditional conflict between reading-transaction TR101 and the validating writing-transaction TR1 related to predicate PR1010 evaluated by TR101; and TR1 is the only currently active transaction that modifies any of the data-cells related to that data-cell set.
In the move-in state, without the modifications TR1 applies, the evaluation of PR1010 will not select row R5. In the move-in state, with the modifications TR1 applies, the evaluation of PR1010 will select row R5. Therefore, to satisfy various transactional consistency expectations, an early commit of TR1 may require “moving R5 into” the set of rows selected by TR101. Since TR101's execution may not be able “to see” TR1's modifications, an early commitment of TR1 may violate the transactional consistency expectations and hence may not be allowed.
Similarly, in general, a move-out state describes a situation where, without TR1's modifications, TR101 will select R5, whereas if TR1's modifications were included, TR101 would not select R5. Therefore, similarly, TR1's early commitment may not be allowed.
In general, a stay-in state describes the situation where, under similar conditions as described above, TR101 would select R5, with or without including TR1's modifications. Therefore, it can be viewed as if the early commit of TR1 keeps R5 “stays in” the set of rows selected by TR101.
Similarly, a stay-out state describes the situation where TR101 would not select R5, with or without including TR1's modifications. Therefore, this state can be viewed as if the early commit of TR1 makes R5 “stays out” of the set of rows selected by TR101. Under some conditions, and from the perspective of this specific conditional conflict, TR1 may be allowed to commit earlier than TR101 for both stay-in and stay-out cases. This scenario is disclosed in detail in the above-referenced Ser. No. 18/944,462 application.
In an embodiment, this evaluation may utilize an epsilon checking procedure (based on the epsilon principle explained herein). Given a validating-agent TR1-AG20 that has a conditional conflict with a reading-transaction agent TR101-AG20, the epsilon checking procedure determines whether a state of a conditional conflict is a stay-in, stay-out, move-in, or move-out state. The epsilon checking procedure relates to two methods of characterizing the moments immediately before and immediately after a transaction commitment. The function ε−(TR1) may be denoted to describe the moment immediately prior to the commitment of transaction TR1, while the function ε+(TR1) may be denoted to describe the moment immediately following the commitment of validating-transaction TR1.
In an embodiment, by epsilon checking procedure, an evaluation of a predicate of a transaction may be denoted in relation to a specific timepoint for a specific row. For example, fora predicate PR1010, a function PR1010(x, ε+(TR1)) will return the evaluation of PR1010 for the pertinent data-cell set of row ‘x’ at the moment immediately following the commitment of a validating-transaction TR1. According to an example embodiment, a validating-transaction TR1 is initiated after a reading-transaction TR101 is initiated, but before reading-transaction TR101 is committed. TR101 involves the evaluation of a predicate PR1010, and validating-transaction TR1 is validating. In this example, the epsilon principle allows for TR1 to commit before the commitment of TR101 if PR1010(x, ε+(TR1))=PR1010(x, ε−(TR1)). Therefore, if the evaluation of PR1010 at row ‘x’ returns the same values immediately prior to validating-transaction TR1's commitment as immediately following validating-transaction TR1's commitment, validating-transaction TR1 may be allowed to commit before the commitment of reading-transaction TR101. The result is denoted that the “epsilon principle is satisfied”. It should be noted that in a plurality of embodiments, there may be more than one predicate that would need to satisfy the epsilon principle in order to allow for a validating-transaction TR1 to commit early.
It should be noted that for the PR1010(x, ε−(TR1)) function calculation, the values to be evaluated by the predicate function are those that are currently committed. For the PR1010(x, ε+(TR1)) function calculation, the values to be evaluated by the predicate, for data cells that were not modified by TR1, are those that are currently committed. In addition, for data cells that were modified by TR1 the values to be evaluated by the predicate are those written by TR1. This scenario is disclosed in detail in the above-referenced Ser. No. 18/944,462 application.
For brevity and clarity, the combination of a predicate (e.g., PR1010) and a data-cell set (e.g., a specific row R5) it evaluates will be notated as “[PR1010*R5]”. It should be noted that a conditional conflict between validating writing-transaction TR1 and reading-transaction TR101 on behalf of predicate PR1010 and row 5 may be denoted as a conditional conflict related to [PR1010*R5]. Additionally, in an embodiment, the predicate evaluates multiple rows of a table. As each data-cell set represents data-cells that belong to the same row, the identity of a row (e.g., row R5) and the corresponding data-cell set evaluated by the predicate (e.g., DCS5) will be used interchangeably to denote the associated data-cell set.
In some embodiments, if the epsilon principle is satisfied, an active protective declaration will be issued by the validating-agent TR1-AG20, sourced by the reading-transaction agent TR101-AG20, for a specific predicate of a data-cell set where the epsilon principle is satisfied (e.g., for [PR1010*DCS1]. An active protective declaration is a mechanism that prevents other writing transactions that modified the contents of that data-cell set from committing. That is, as long as that active protective declaration exists, no other writing-transaction agents (e.g., TR2-AG20), if any, that modified any of DCS1's cells, will be able to progress to commitment and/or to evaluate the epsilon principle for [PR1010*DCS1]. Such a declaration functions to ensure the contents of a data cell in which the epsilon principle is satisfied are not modified by another transaction such that the epsilon principle is not satisfied. Generally, a protective declaration is associated with a conditional conflict between a validating writing-transaction TR1 and a reading-transaction TR101. Such a protective declaration is denoted as “issued” by the validating writing-transaction TR1 (or denoted as “issued” by the validating-agent T R1-AG20, where those two ways of denoting may be used interchangeably) and is denoted as “sourced” by the reading-transaction TR101 (or denoted as “sourced” by the reading-transaction agent TR101-AG20, where those two ways of denoting may be used interchangeably). A protective declaration may have inactive states wherein the protective declaration exists for a specific predicate of a data-cell set but does not have the protective effect as described above. Such a state may be changed from inactive to active or active to inactive in various embodiments. It should be noted that if no conditional conflicts are identified, S540 is skipped. The operation of S540 is described in more detail with respect to
At S550, it is determined whether the validating-agent TR1-AG20 can progress to commitment. This may include checking various conditions. That is, the validating-agent TR1-AG20 is instructed to wait until validation conditions are satisfied. The validation conditions include: (1) validating-agent TR1-AG20 is neither conditionally nor non-conditionally dependent on any reading-transaction (conditional and non-conditional dependencies are described in further detail hereinafter); (2) all non-conditional conflicts were processed at S520, and all conditional conflicts were processed at S540; (3) the validating-agent TR1-AG20 does not have any evaluation-pending validations or evaluation-deferred validations; and (4) any active protective declarations issued by validating-agent TR1-AG20 are not part of a challenging process (discussed in detail later). If the validation conditions are met, execution continues with S560.
Execution reaches S560 when the validating-agent TR1-AG20 can progress to commitment. To this end, S560 includes placing a commit pause on data cells modified by the validating-transaction TR1. Placing the commit pause allows the transaction to progress to the commit-phase. It should be noted that, in an embodiment, standard means of avoiding race-conditions are assumed. In an example embodiment, the placement of commit pause at S560 is achieved atomically with the conclusion that all validation conditions are met at S550.
It should be noted that in an embodiment, the progress to commitment of TR1-AG20 (as well as the commitment process for the validating-transaction TR1) may be paused for as long as the reading transactions it depends on do not complete their execution, that is, until those reading transactions commit or abort. For example, the determination that a conditional conflict is in a move-in state will lead to a dependency of the validating-agent TR1-AG20 on the corresponding reading-transaction. This pause is initiated in order to preserve concurrency control and prevent the committed values of the validating-transaction TR1 from compromising data integrity.
Although
At S610, it is determined whether the validating-agent TR1-AG20 is already non-conditionally dependent on the reading-transaction agent TR101-AG20. If so, execution ends. Otherwise, execution continues with S620.
At S620, validating-agent TR1-AG20 is marked as non-conditionally dependent on the reading-transaction agent TR101-AG20, reflected in, for example, a transaction commitment dependency graph.
At S630, any protective declarations that were issued by the validating-agent TR1-AG20 that were sourced by reading-transaction agent TR101-AG20 are removed. It should be noted that such a protective declaration can exist if, for example, the validating-agent TR1-AG20 previously validated a conditional conflicting RV-entry of the reading-transaction agent TR101-AG20 related to another data-cell that validating-agent TR1-AG20 modified (or even related to the same data-cell that validating-agent TR1 currently validates), in the context of a predicate TR101-AG20 evaluates. It should be noted that as the validating-agent TR1-AG20 is marked as non-conditionally dependent on reading-transaction agent TR101-AG20, the predictive CCP does not pursue any attempt for committing validating-transaction TR1 earlier than reading-transaction TR101. Therefore, means such as those protective declarations (issued by validating-agent TR1-AG20 and sourced by reading-transaction agent TR101-AG20) are no longer relevant and, hence, are removed.
In one embodiment, if such a removed protective declaration is an active protective declaration, this act may have a “chain reaction”, as there may be another validating-agent TR2-AG20 (or even multiple validating-agents in node N20) that is “interested” in performing an epsilon checking procedure for the corresponding predicate and data-cell set. Validating-agent TR2-AG20 cannot perform that checking procedure as it was blocked by the above active protective declaration issued by validating-agent TR1-AG20. As a result of the removal, validating-agent TR2-AG20 will no longer be blocked, and can now perform the epsilon checking procedure. In an embodiment, if there are multiple “interested” validating-agents in node N20, the epsilon checking procedures can be performed only once at a time.
In the case that a previous conditional-conflict validation of the validating-agent TR1-AG20 for a conditional-conflict with reading-transaction agent TR101-AG20 (for another modified data-cell or for the currently validated data-cell) is blocked by an active protective declaration made by another validating-agent in node N20 (e.g., TR2-AG20), then validating-agent TR1-AG20 may continue such a blocked validation once the protective declaration is no longer active. However, since validating-agent TR1-AG20 is now non-conditionally dependent on reading-transaction agent TR101-AG20, there is no need to further pursue any attempt for committing earlier than reading-transaction TR101. Therefore, the blocked validation of TR1-AG20 is no longer relevant and can be canceled. Similarly, if a previous such conditional-conflict validation is deferred, that deferred validation is no longer relevant and can be cancelled.
It should be noted that process S520 is iteratively performed for each non-conditional conflict identified at S510 (
As already mentioned, in an embodiment, the data-cell set represents a specific row evaluated by the predicate.
At S710, it is determined whether the validating-agent TR1-AG20 is already non-conditionally dependent on the reading-transaction agent TR101-AG20. If so, execution ends. Otherwise, execution continues with S720.
At S720, for the identified conditional conflict, it is determined whether there is already an active protective declaration, issued by another validating-agent (e.g., TR2-AG20), sourced by TR101-AG20, for [PR1010*DCS1] that prevents the validating-agent TR1-AG20 from performing the epsilon checking procedure. If S720 results in a Yes answer, execution proceeds to S730; otherwise, execution continues with S740.
At S730, an inactive evaluation-pending protective declaration is issued by validating-agent TR1-AG20 and sourced by reading-transaction agent TR101-AG20. The protective declaration is issued for the pertinent predicate and data-cell set [PR1010*DCS1]. An inactive evaluation-pending protective declaration is a protective declaration as defined above but, unlike active protective declarations, its existence does not block other validating-agents. It is used to mark that TR1-AG20 may still need to perform the pertinent epsilon check for [PR1010*DCS1].
It should be noted that, according to an embodiment, if the currently active protective declaration issued by another validating-agent on node N20 (e.g., TR2-AG20) that blocked the epsilon checking of TR1-AG20 is later inactivated and/or removed, then the validation-agents on node N20 whose epsilon checking procedures are pending (represented by their inactive evaluation-pending protective declarations) may proceed to perform their epsilon checking procedures. In the case that there are multiple validation-agents that are waiting to perform the epsilon checking procedure for that specific predicate and data-cell set, that epsilon checking procedures are not performed simultaneously, but one transaction after the other. When TR1-AG20, having the inactive evaluation-pending protective declaration on [PR1010*DCS1] is enabled for an epsilon checking procedure as described above, its corresponding inactive evaluation-pending protective declaration may be removed. As discussed later in more detail, if the epsilon principle is satisfied for that epsilon check, TR1-AG20 issues an active protective declaration, sourced by TR101-AG20, on [PR1010*DCS1]. In such a case, if there are additional validating-agents waiting to perform the epsilon checking procedure for [PR1010*DCS1] such validating-agents will be blocked again, this time by the active protective declaration that was just issued by TR1-AG20.
After execution at S730 completes, the process discussed in
At S740, an epsilon checking procedure is performed on the specific conditional conflict with reading-transaction agent TR101-AG20. That is, given a validating-agent TR1-AG20 that has a conditional conflict with a reading-transaction agent TR101-AG20 for [PR1010*DCS1], the epsilon checking procedure determines the state of the conditional conflict, e.g., stay-in, stay-out, move-in, or move-out state. The epsilon checking procedure operates as explained above with respect to
It should be noted that, according to some embodiments, the epsilon checking procedure is not performed or is deferred according to various considerations including but not limited to the following: whether all or some of the data is not in cache memory, the length of time of execution of a transaction, the computational demand of evaluating a predicate, etc. Additionally, the decision not to perform an epsilon checking procedure for [PR1010*DCS1] by TR1-AG20 may be reversible or irreversible, according to various embodiments. In an embodiment, if the decision is irreversible, the specific conditional conflict is transformed to be a non-conditional conflict. The steps described in
In an embodiment, a reversible decision for not performing an epsilon checking may be taken in a reversible manner. That is, the epsilon check is deferred, such that an opposite decision to perform the specific epsilon checking could be taken later (in an example embodiment, the decision can be re-evaluated in half a second). An inactive evaluation-deferred protective declaration is issued by validating-agent TR1-AG20, sourced by reading-transaction agent TR101-AG20, for [PR1010*DCS1]. Then, the process discussed in
At S750, it is determined if the epsilon principle is satisfied. In an embodiment, the epsilon principle is satisfied when the epsilon checking procedure classifies the conditional conflict's state as either stay-in or stay-out and is not satisfied when the conflict's state is move-in or move-out. If S750 results in a YES answer, execution continues with S760; otherwise, execution continues with S770.
At S760, an active protective declaration, sourced by reading-transaction agent TR101-AG20, for the predicate and data-cell set [PR1010*DCS1] is issued by TR1-AG20. This active protective declaration ensures that the data-cell set for which the epsilon checking procedure resulted in the epsilon principle being satisfied is not modified until validating-transaction TR1's commitment. In this case, validating-agent TR1-AG20 is denoted as “conditionally independent” of reading-transaction agent TR101-AG20 for the specific predicate for the specific data-cell set [PR1010*DCS1]. It should be noted that when and if the active protective declaration becomes inactive and/or is removed prior to validating-transaction TR1's (and its agent TR1-AG20) commitment, the results of the epsilon checking procedure may become invalid at the moment of such an event. It should be noted that, in an embodiment, standard means of avoiding race-conditions are assumed. In an example embodiment, the epsilon checking procedure and the issuance of the active protective declaration are achieved atomically with each other. It should be noted that a conditional independency may be denoted as existing between a writing-transaction and a reading-transaction, and it can also be denoted as existing between a writing-transaction agent and a reading-transaction agent, or any combination thereof (e.g., between a transaction and an agent, etc.). All those ways of notation are equivalent and may be used interchangeably.
At S770, an inactive conditionally-dependent protective declaration for a predicate and data-cell set [PR1010*DCS1] is issued. In some embodiments, the inactive conditionally-dependent declaration is issued by validating-agent TR1-AG20 and sourced by reading-transaction agent TR101-AG20. This occurs because the epsilon principle is not satisfied for that predicate. Validating-agent TR1-AG20 is said to be “conditionally dependent” on reading-transaction TR101-AG20. The issuance of an inactive conditionally-dependent protective declaration allows validating-agent TR1-AG20 to perform the epsilon checking procedure again for that predicate and data-cell set [PR1010*DCS1] in the case where another validating-agent (e.g., TR2-AG20) modifies one (or more) of the pertinent data-cells and commits. The re-evaluation as part of the epsilon checking procedure may result in satisfaction of the epsilon principle, and validating-agent TR1-AG20 may thus be able to commit earlier than reading-transaction TR101. It should be noted that a conditional dependency may be denoted as existing between a writing-transaction and a reading-transaction, and it can also be denoted as existing between a writing-transaction agent and a reading-transaction agent, or any combination thereof (e.g., between a transaction and an agent, etc.). All those ways of notation are equivalent and may be used interchangeably.
It should be noted that, in an embodiment, if the epsilon checking procedure results in an unsatisfied epsilon principle, then, instead of making validating-agent TR1-AG20 conditionally dependent on reading-transaction agent TR101-AG20, the specific conditional conflict may be transformed to be a non-conditional conflict. The process described in
At S810, all commit pauses placed by agent TR1-AG20 are released. Furthermore, when an agent that wrote to data cells commits (i.e., its transaction committed and it received the committed message), the contents of the data cells become “committed” to override data currently stored in data cells. In an embodiment, the data-cells may become “committed” prior to the release of the commit pauses. It should be noted that the commit pauses that are released are only the commit pauses that the agent TR1-AG20 placed, even if there are multiple commit pauses taken on the same data-cell. For example, if a commit pause was placed by agent TR1-AG20 on data cell DC1 and another validating-agent TR2-AG20 placed a commit pause on DC1, the commit pause of agent TR1-AG20 on DC1 will be released, but the commit pause of validating-agent TR2-AG20 on DC1 will not be released at that time.
At S820, non-conditional dependencies are removed. In an embodiment, the non-conditional dependencies that are removed are only the non-conditional dependencies of other validating-agents on agent TR1-AG20. For example, if validating-agent TR2-AG20 is non-conditionally dependent on agent TR1-AG20, the non-conditional dependency will be removed at this step as designated, for example, in a transaction commitment dependency graph.
At S830, all the active protective declarations that agent TR1-AG20 issued are removed. In some embodiments, all such active protective declarations are removed by, for example, a transaction agent. As a result of the removal of the active protective declarations that transaction TR1-AG20 issued, in further embodiments, validating-agents that were previously blocked from validating a specific data-cell set (as it was protected by an active protective declaration that TR1-AG20 issued), including those having an inactive evaluation-pending protective declaration or an inactive conditionally-dependent protective declaration, may progress with their respective validations.
In some embodiments, these validations do not take place simultaneously but execute one after the other. According to this embodiment, the order of execution of previously blocked validating-agents mentioned above are determined according to a variety of strategies. In other embodiments, these validations may be executed simultaneously.
At S840, all protective declarations (whether active or inactive) that agent TR1-AG20 sourced are removed. In some embodiments, this removal is performed by a transaction agent. This step deals with the case where validating-agent TR2-AG20 is a transaction that issued a protective declaration (e.g., PD2) for a predicate and a data-cell set [PR1*DCS1] that agent TR1-AG20 sourced, and PD2 still exists.
In one embodiment, validating-agent TR2-AG20 issued PD2 as an active protective declaration. PD2 is clearly not required anymore, as TR2-AG20 need not do any validation against TR1-AG20 as TR1 already committed. Once PD2 is removed, there may be other validating-agents (e.g., TR3-AG20, TR4-AG20) that are waiting to progress to validation on [PR1*DCS1] as described hereinbefore. Such other validations may not be required and may be canceled during the execution of S840. In some embodiments, no other protective declarations for [PR1*DCS1] are issued because they are removed at this step.
In another embodiment, validating-agent TR2-AG20 issued PD2 as an inactive conditionally-dependent protective declaration, wherein TR2-AG20 is conditionally dependent on agent TR1-AG20, in the context of [PR1*DCS1]. PD2 is sourced by TR1-AG20. Since TR1 (and TR1-AG20) has committed, validating-agent TR2-AG20 is no longer dependent on TR1-AG20. Therefore, PD2 can be removed. As a result, validating-agent TR2-AG20 may progress to commitment, subject to further conditions, such as the ones described in step S550.
In another embodiment, validating-transaction TR2 issued PD2 as an evaluation-pending or evaluation-deferred protective declaration. Since TR1 has committed, validating-transaction TR2 is no longer dependent on TR1 and hence need not perform any related validation in the context of [PR1*DCS1]. Therefore, PD2 can be removed. As a result, validating-transaction TR2 may progress to commitment, subject to further conditions, such as the ones described in step S550.
As discussed above, as part of a transaction statement execution, one or more nodes from which data is to be read are identified. In one embodiment, the statement execution involves a predicate evaluation, for example, selecting rows of an employee-table. As discussed hereinabove, in a distributed database, that predicate evaluation may involve reading data-rows stored on a plurality of the employee-table-nodes. Depending on the strategy decided by the database query planner and/or optimizer for the specific predicate evaluation, and depending on other factors, such as whether indexes can be used as part of the predicate evaluation, the predicate evaluation may require reading data-rows from all the employee-table-nodes (thus, all the employee-table-nodes are covered-nodes) or not from all the employee-table-nodes (thus, some of the employee-table-nodes are uncovered-nodes). As discussed hereinabove, as part of the complete-instantiation-technique, S330 includes the identification (determination) of all the employee-table-nodes, including the covered-nodes as well as the uncovered-nodes.
In an embodiment, according to the complete-instantiation-technique, each identified table-node for a predicate evaluation is called to instantiate a conditional RV-entry. That is, a conditional RV-entry provided for representing the pertinent predicate evaluation by the reading-transaction TR101 is instantiated on each of the identified table-nodes (covered-nodes as well as uncovered-nodes) associated with the corresponding table. Such agents for covered-nodes may each be instantiating that conditional RV-entry as part of their procedure for reading data-cells, as discussed in detail hereinbelow. Additionally, agents for uncovered-nodes are each called to instantiate that conditional RV-entry despite the fact these agents do not perform data-cell reads for the purpose of the corresponding predicate evaluation. Process 900 describes the corresponding steps taken by a pertinent agent in a covered-node as well as in an uncovered-node. In another embodiment, the data stored is to be read not as part of a predicate evaluation.
At S910, a non-conditional and/or conditional RV-entry is added to TR101-AG20's RV. In an embodiment, the addition of such RV-entries is performed by TR101-AG20. In one embodiment, a non-conditional RV-entry is added for intended data-cell reads that are not part of a predicate evaluation. As described hereinbefore, this non-conditional RV-entry designates the data-cells that are to be read. In another embodiment, a conditional RV-entry is added as part of a predicate evaluation, whether N20 is a covered-node or an uncovered-node for the predicate evaluation. It should be noted that, as discussed hereinabove, if N20 is a covered-node, then data-cells are about to be read as part of a predicate evaluation, and if N20 is an uncovered-node, such data-cells are not to be read. As described hereinbefore, this conditional RV-entry (whether in a covered-node or in an uncovered-node) designates the entire predicate evaluation, including information describing the predicate that is evaluated. A single conditional RV-entry may represent a predicate evaluation of a single data-cell set or of multiple data-cell sets. That is, a conditional RV-entry may describe a set of cells that belong to the same row and/or a set of cells that belong to the same columns of a set of rows. A predicate evaluation of multiple data-cell sets may occur, for example, if the scope of the predicate contains multiple rows or the entire set of rows of a table.
In an embodiment, adding a conditional RV-entry to reading-transaction TR101's RV in each respective table-node, including to uncovered-nodes, allows an agent of validating-transaction TR1, in any of the respective table-nodes to identify conditional conflicts, even if the corresponding TR1's agent resides in an uncovered-node. An example demonstrating conditional RV-entries is provided below.
At S920, all existing writing-transaction agents in node N20 that already completed their working-phases and that modified data-cell(s) that are represented by the RV-entry added at S910 are identified (hereafter denoted as “conflicting transaction”). For example, such identification may include scanning the WV-entries of writing-transaction agents that conflict with the RV-entry to identify writing-transaction agents that may be concluded to be conflicting with TR101-AG20. In an embodiment, this scanning of the WV-entries of transactions is performed by TR101-AG20. It should be noted that this step is performed also in the case where N20 is an uncovered-node (in regard to and in the case of a corresponding predicate evaluation).
At S930, for each such identified conflicting validating-transaction agent (e.g., validating-agent TR1-AG20), validating-agent TR1-AG20 is notified about the addition of the non-conditional and/or conditional RV-entry that effectively represents a conflict between the validating-agent TR1-AG20 and the reading-transaction agent TR101-AG20. In an embodiment, reading-transaction agent TR101-AG20 notifies validating-agent TR1-AG20 so that validating-agent TR1-AG20 can perform the validation. In some embodiments, reading-transaction agent TR101-AG20 actually performs validation acts for validating-agent TR1-AG20. By the time reading-transaction agent TR101-AG20 adds the RV-entry and/or performs its related read(s), validating-agent TR1-AG20 may already be in its validation-phase. This disclosed embodiment ensures that validating-agent TR1-AG20 does not ignore the validation for that conflict, which serves to ensure the correctness of the predictive CCP. It should be noted that this step is performed also in the case where N20 is an uncovered-node (in regard to and in the case of a corresponding predicate evaluation).
In an embodiment, if TR101-AG20 notifies TR1-AG20 about the addition of a non-conditional RV-entry, the process disclosed above and in
In an embodiment, if TR101-AG20 notifies TR1-AG20 about the addition of a conditional RV-entry, the process disclosed above and in
In an embodiment, if TR1-AG20 holds a commit pause on the data-cells that TR1-AG20 modified and are related to the data-cells that are represented by TR101-AG20's notification, then TR101-AG20's notification to TR1-AG20 may be delayed until the related commit pause(s) is released. In an embodiment, the process will not progress to S940 until that notification takes place.
In some embodiments, if TR1-AG20's validation-phase has not yet processed the pertinent conflict, then TR1-AG20 may ignore the notification from TR101-AG20, and the validation may be performed later.
At S940, the data-cell(s) are read. In some embodiments, the reading-transaction agent TR101-AG20 performs the read. In an embodiment, in the case of a conditional RV-entry that is added in an uncovered-node, actual reads may not be needed.
It should be noted that in order for the early commitment of the validating-transaction to satisfy concurrency control requirements, the evaluation of the predicate of a reading-transaction TR101 should follow a set of conditions. The set of conditions may be denoted as the single predicate evaluation consistency principle. It should be further noted that the above disclosed embodiments with respect to the process 900 of
In an embodiment, reading-transaction TR101-AG20 can perform predicate-evaluation in compliance with the predicate evaluation consistency principle by performing the predicate evaluations for each of the data-cell sets (e.g., for each of the rows) after the corresponding conditional RV-entry was added, where the reading of each data-cell set should be done atomically. It should be noted that different data-cell sets need not be read atomically with each other.
It should be noted that there may be cases where reading multiple data-cells of the same data-cell set (e.g., DCS1), for the sake of a predicate evaluation, cannot be achieved atomically, e.g., because those read operations cannot be done close enough to each other in time, thus making it too expensive or impossible to ensure atomicity of these operations. In an embodiment where reading a single data-cell is easily achieved in an atomic manner, the above-mentioned difficulty is mainly relevant for a multi-cell predicate, that is, a predicate with a data-cell set containing two or more data-cells. The following paragraph illustrates an example of this problem, and the subsequent paragraphs illustrate various embodiments that enable atomicity of transactions and ensure the single predicate evaluation consistency principle is adhered to.
In an example embodiment, in a relational database, a predicate uses a data-cell set of two data-cells. For example, the predicate searches for all employees in an employee table with a red hair-color and a profession of a carpenter. These two columns (hair-color and profession) are indexed. In the case that the expected population (i.e., selected rows) of the predicate is a small fraction of the table that they are in, a query planning optimizer may choose to calculate that predicate by performing two index searches and then intersect the two resulting row IDs (assuming the index value is the row ID). Through this approach, two data cells of a data-cell set DCS1 (that is, for example, one of the rows that predicate evaluates) are read completely separately as two separate search operations in two separate indexes that are each ordered differently. Therefore, the two data cells will not be read at more-or-less the same time, and therefore it will be hard and/or expensive to read them both atomically.
In one embodiment of the present disclosure, Point-in-Time (PiT) Imaging is used to address this problem. A PiT is an image of contents of a data-cell at a specific time (which is after the conditional RV-entry is added). A task in a database performs a PiT read when, as long as the specific PiT is created and exists, it reads a row-version for that point in time even though the database's execution progresses. In an embodiment, a PiT may be created for reading-transaction TR101 after the RV-entry is created. In an embodiment the RV-entries for the predicate evaluation are completely instantiated across all the pertinent table-nodes before the index search takes place, and the index search is based on the PiT contents. The results of the PiT reads are used for the predicate evaluation. In some embodiments, a single PiT is used for all the data-cell sets evaluated by the predicate. In other embodiments, multiple PiTs are used, as long as all the data-cells of the same data-cell set are read from the very same PiT. In some embodiments, a PiT is used only for some of the data-cell sets, where the predicate evaluation for other data-cell sets is achieved differently.
In another embodiment, an ad hoc PiT is used. According to this embodiment, PiTs are created for the columns (and/or the indexes) that are part of a predicate evaluation. Just before the predicate evaluation begins, whenever a modification is made to the contents of a data-cell that is involved in the predicate evaluation, the older content is copied aside and used for the predicate evaluation. Once the predicate evaluation is over, the “previous versions” kept aside are erased. Such an embodiment may be more efficient than a more generic PiT usage, and therefore may be adequate also for databases that already support generic PiT usage.
In yet another embodiment, predicate decomposition is used. According to this embodiment, multi-cell predicates are decomposed into predicate 1 and predicate 2 and two separate RV-entries are added to reading-transaction TR101 agents' RV for each predicate. As such, the predicate evaluation consistency principle will be effective on each of those predicates, separately.
The following is another example that illustrates an embodiment of the complete-instantiation technique.
An employee table is stored on employee table-nodes N20, N21, N22, N23, and N24. For the predicate (PR1010) of all employees in the employee table that have red/orange hair-color and have a profession of software engineer/software architect, a transaction TR101, managed by a transaction manager TR101-TM on node N101, doubles the salary of these employees. TR101's execution is achieved by a full-scan (this example does not use indexes). Relevant to this example are Jane, who, prior to TR101's execution, is a software engineer with orange hair-color whose data-row resides in N21, and George, who, prior to TR101's execution, is a software engineer with brown hair-color whose data-row resides in N22.
At time t=100, TR101 starts its execution. At t=105, all of TR101's agents are running in their respective nodes N20 through N24. At this time, all of TR101's agents (i.e., TR1-AG20, TR1-AG21, TR1-AG22, TR1-AG23, and TR1-AG24) added conditional RV-entries pertinent to the predicate evaluation in their respective nodes N20 through N24.
TR1 is a writing-transaction, managed by transaction manager TR1-TM on N1, that modifies Jane's and George's hair-color to red. In this scenario, TR1 and TR101 are the only transactions in the system. As a result, TR1 will not be able to early-commit over TR101, as the modifications of TR1 creates a move-in scenario for George. This is further illustrated hereinbelow.
At t=120, TR1 starts its execution. TR1 creates two transaction agents: TR1-AG21 on N21 and TR1-AG22 on N22. At t=125, TR1-AG21 and TR1-AG22 respectively modify Jane's and George's hair-color to red in an uncommitted manner such that the new Jane's and George's data-cell (or data-row) versions are uncommitted. TR1-AG21 and TR1-AG22 notify TR1-TM of their modification completion. At this stage, TR1 effectively completes its working-phase.
At t=127 TR1-TM instructs both TR1-AG21 and TR1-AG22 to start their respective validations. At t=130, TR1-AG21 validates. TR1-AG21's validation loop identifies a conflict between TR1's WV-entry that corresponds to Jane's hair-color data-cell and TR101's RV-entry (added by TR101-AG21) that corresponds to Jane's hair-color data-cell. As a result, TR1-AG21 performs the epsilon checking procedure and finds that the epsilon principle is satisfied (as this is a stay-in scenario for Jane because her data-cell contents were modified from orange hair-color and software engineer to red hair-color and software engineer). As such, TR1-AG21 issues a corresponding active protective declaration. At this time, TR1-AG21 takes a commit pause and notifies TR1-TM that it is ready to commit.
At t=132, TR1-AG22 validates. TR1-AG22's validation loop identifies a conditional conflict between TR1's WV-entry that corresponds to George's hair-color data-cell modification and TR101-AG22's RV-entry that corresponds to predicate PR1010's evaluation on node N22. The conditional conflict is on [PR1010*Rgeorge]. As a result, TR1-AG22 performs the epsilon checking procedure and finds that the epsilon principle is not satisfied (as this a move-in scenario for George because his data-cell contents were modified from brown hair-color and software engineer to red hair-color and software engineer). As such, TR1-AG22 issues a corresponding inactive conditionally-dependent protective declaration. At this time, TR1-AG22 conditionally depends on TR101. Because of this dependency, TR1-AG22 will not progress to commitment, and, therefore, TR1-TM will not progress to commitment.
Once TR101 completes and commits, TR1-AG22's protective-declaration will be removed (as the protective declaration is no longer relevant), and TR1-AG22 will be able to take the commit pause and notify TR1-TM. As a result, TR1-TM will progress to commitment. In this example scenario, TR1 did not early commit over TR101, due to the move-in scenario for George.
The following example is another scenario, based on the initial conditions of the above example up to and including activities at t=132, that illustrates an embodiment of the complete-instantiation technique. This example involves three transactions: TR101,TR1 (as of the previous example) and TR2.
TR2 is a writing transaction, managed by transaction manager TR2-TM on N2, that modifies George's profession to dentist. At t=140, TR2 starts its execution. TR2 creates a transaction agent, TR2-AG22, on N22. At t=145, TR2-AG22 modifies George's profession to dentist in an uncommitted manner such that the modified-by-TR2 George's data-cell (or data-row) version is uncommitted. TR1-AG22 notifies TR2-TM of its modification completion. At this stage, TR2 effectively completes its working-phase.
At t=147, TR2-TM instructs TR2-AG22 to start its validation. At t=150, TR2-AG22 validates. TR2-AG22's validation loop identifies a conditional conflict between TR2's WV-entry that corresponds to George's profession data-cell modification and TR101-AG22's RV-entry that corresponds to predicate PR1010's evaluation on node N22. The conditional conflict is on [PR1010*Rgeorge]. As a result, TR2-AG22 performs the epsilon checking procedure and finds that the epsilon principle is satisfied (as this is a stay-out scenario for George because his data-cell contents were modified from brown hair-color and software engineer to brown hair-color and dentist). It should be noted that TR2 does not use the “red” hair-color that was written by TR1 because TR1 has not yet committed. Instead, TR2 performs the epsilon checking procedure based on the currently committed contents of that data-cell. As such, TR2-AG22 issues an active protective declaration. At this time, TR2-AG22 takes a commit pause and notifies TR2-TM that it is ready to commit.
At t=155, TR2-TM commits TR2. TR2-TM notifies TR2-AG22 to take the relevant actions upon commitment, e.g., release of the commit pause. Additionally, TR2-AG22 removes its active protective declaration, which enables a re-validation of any inactive conditionally-dependent protective declaration issued by TR1-AG22 at t=132.
At t=160, TR1-AG22 re-performs the epsilon checking procedure. At this time, TR1-AG22 finds the epsilon principle is satisfied (as this is now a staying-out scenario for George because his data-cell contents were modified from brown hair-color and dentist (dentist, due to TR2's committed modification) to red hair-color and dentist). As such, TR1-AG22 turns its protective declaration to an active state, takes the commit pause, and notifies TR1-TM that it is ready to commit. At t=165, TR1-TM early-commits over TR101.
This example illustrates how TR2's activity enabled TR1 to be able to early-commit over TR101, whereas TR1, prior to TR2's activity (as described in the first example), could not early-commit over TR101.
Furthermore, the following paragraphs illustrate scenarios wherein a “challenging” process, according to various embodiments, is used.
In some embodiments, the following scenario is addressed: a validating-transaction agent of TR1, (validating-agent TR1-AG20) issues an active protective declaration for a predicate of a data-cell set [PR1010*DCS1] sourced by a reading-transaction agent TR101-AG20, validating-agent TR1-AG20 becomes dependent on another reading-transaction agent TR102-AG20 (or, in some other cases, conditionally-dependent on reading-transaction agent TR101-AG20 on behalf of a different conflict), and there are reasons to believe that this dependency will remain for a sufficiently long time before validating-agent TR1-AG20 could progress to commitment. In this scenario, as well as others, another validating-agent TR2-AG20 may be prevented from validating because of TR1-AG20's active protective declaration on [PR1010*DCS1], which, under some circumstances, may block TR2-AG20 from immediately early committing, such that it might hinder overall concurrency of the predictive CCP.
In one embodiment, validating-agent TR2-AG20 may challenge that active protective declaration. In some embodiments, this challenging process involves the following steps.
First, a request to challenge the active protective declaration is sent to validating-agent TR1-AG20. That is, a request to get a chance to “take over” [PR1010*DCS1], in case the epsilon checking procedure of TR2 will satisfy the epsilon principle. If the epsilon principle is not satisfied, TR1-AG20 will continue to own the active protective declaration on [PR1010*DCS1]. In an embodiment, this request is sent by validating-agent TR2-AG20. It is determined, depending on various factors discussed in more detail below, whether TR1-AG20 will “agree” to accept that challenge. It should be noted that if TR1-AG20 has already progressed to commitment, then it cannot accept the challenge.
If validating-agent TR1-AG20 accepts the challenge, then validating-agent TR2-AG20 performs the epsilon checking procedure for the predicate of the data-cell set [PR1010*DCS1] that has the active protective declaration. In one embodiment, this is performed while the predicate of the data-cell set is still under the active protective declaration issued by TR1-AG20. During this step, validating-agent TR1-AG20 is prevented from progressing to commitment.
In one embodiment, if TR2-AG20's epsilon checking procedure calculates that the epsilon principle is satisfied, then validating-agent TR2-AG20 issues its own active protective declaration for that predicate of the data-cell set. As such, validating-agent TR1-AG20 changes the state of its protective declaration to an inactive evaluation-pending protective declaration.
In another embodiment, if validating-agent TR2-AG20's epsilon checking procedure calculates that the epsilon principle is not satisfied, then the active protective declaration for the predicate of the data-cell set issued by validating-agent TR1-AG20 remains, and validating-agent TR1-AG20 is not prevented (by that challenge processing) from progressing to commitment. Also, in a further embodiment, validating-agent TR2-AG20 issues a protective declaration for the predicate of the data-cell set with an inactive state, e.g., inactive evaluation-pending protective declaration.
If validating-agent TR1-AG20 does not agree to accept the challenge, then the active protective declaration for the predicate of the data-cell set [PR1010*DCS1] that validating-agent TR1-AG20 issued remains, and validating-agent TR1-AG20 is not prevented (by that challenge processing) from progressing to commitment. Also, in a further embodiment, TR2-AG20 issues a protective declaration for the predicate of the data-cell set [PR1010*DCS1] with an inactive state e.g., inactive evaluation-pending protective declaration.
According to the disclosed embodiments, validating-agent TR1-AG20's determination of whether to accept the challenge on the active protective declaration that validating-agent TR1-AG20 issued depends on a variety of conditions, including, but not limited to, the following: (1) how long validating-agent TR1-AG20 holds the active protective declaration, (2) whether validating-agent TR1-AG20 completed its validation-phase, (3) the nature of the transactions validating-agent TR1-AG20 is dependent on (if any), etc. Additionally, such a determination may depend on a variety of factors, including, but not limited to, the following: (1) how to avoid or minimize negative dynamics, including, for instance, high-frequency oscillation scenarios where validating-agent TR2-AG20 challenges TR1-AG20's active protective declaration, and, as a result, issues an active protective declaration only to inactivate it a microsecond later as it is challenged by TR1-AG20 that, as a result activates its protective declaration again.
According to the disclosed embodiments, validating-agent TR2-AG20's determination of whether to challenge validating-agent TR1-AG20's active protective declaration and when to do so depends on a variety of factors. In some embodiments, validating-agent TR2-AG20 challenges validating-agent TR1-AG20's active protective declaration only after it has completely performed its validation-phase and is not dependent on any other transactions. However, in other embodiments, validating-agent TR2-AG20 may challenge validating-agent TR1-AG20's active protective declaration earlier, later, or not at all.
It should be noted that certain transactions involving referenced self-writes may cause inconsistencies when the procedures described above are applied and thus may require modifications to the procedures. According to the present disclosure, a referenced self-write is a writing operation performed by a reading-transaction agent, for example, TR101's agent on table-node N20, TR101-AG20, to a data-cell that TR101-AG20 will later on read as part of a predicate evaluation. A referenced self-write may interfere with the procedures described above by remaining undetected and unread by a validating transaction agent, for example, TR1's agent on node N20, TR1-AG20, possibly resulting in an incorrect determination that TR1 is able to commit early.
The inconsistencies that may be created by the presence of a referenced self-write may be demonstrated by the following example. According to this example, two transactions in a database are executed: TR101 and TR1. The database includes a table with rows representing employees and each row contains a plurality of characteristics for its respective employee, including hair color, salary, and profession. Within this example database, there exists an employee, “Jane,” who has blond hair and is a dentist. Jane's data-row resides on employee-table-node N20. TR101 is a reading-transaction that, by using its agent in node N20, TR101-AG20, first modifies Jane's hair color to “red”. Then, TR101-AG20 scans all the employees, and for each employee, TR101-AG20 reads the cells corresponding to the employee's hair color and profession, and then modifies the cell for that employee's salary if the employee has red or orange hair and is a software engineer, conditions which may be denoted as predicate PR1010. TR1 is a transaction that, by using its agent in node N20, TR1-AG20, modifies Jane's profession to “software engineer”. As TR101-AG20 executes, it will first modify “Jane's” hair color from blond to red. This modification is the referenced self-write.
It should be noted that, as part of the predictive CCP that guarantees serializability and other expected consistency properties, such a modification (by TR101-AG20) is done in an uncommitted manner and hence is not generally visible to other transactions. However, for the sake of serializability and other expected consistency properties, such a modification should be visible to the reading-transaction agent TR101-AG20 itself, if it later on reads the pertinent modified data-cell. That is, if TR101-AG20 later on reads Jane's hair color, the content it should read would be “red”. TR101-AG20 will then read on the currently committed cells with the addition of any self-written modifications. TR101-AG20 will thus read on the cells as if employee “Jane” has red hair and is a dentist. Note that prior to scanning the employees and evaluating the predicate PR1010 on each, TR101-AG20 adds a conditional RV-entry describing PR1010's evaluation.
According to the procedures described above, the cells that are read by TR101-AG20 will be inconsistent with the cells assumed to be read by TR101-AG20 from the logic of the validating-agent TR1-AG20, as TR1-AG20 would not recognize the modification of “Jane's” hair color to red by TR101-AG20. Instead, TR1-AG20 would assume that “Jane's” hair color remains blond. As TR1-AG20 executes concurrently with TR101-AG20, it will modify “Jane's” profession to “software engineer” and add the modified row to its WV. The epsilon checking procedure will then be applied to the conditional conflict detected between TR1-AG20 and TR101-AG20. The predicate of TR101, PR1010, will be evaluated for both the ε− and ε+ conditions as applied to TR1. In other words, PR1010(Jane, ε−(TR1)) and PR1010(Jane, ε+(TR1)) will be evaluated. With regards to PR1010(Jane, ε−(TR1)), it will be determined that from the perspective of TR1-AG20, at the moment before the commitment of TR1, the predicate evaluation will return a value of “FALSE” since “Jane's” hair color is assumed to be blond. With regards to PR1010(Jane, ε+(TR1)), it will be determined that from the perspective of TR1-AG20, at the moment after the commitment of TR1, the predicate evaluation will return a value of “FALSE” since “Jane's” hair color is assumed to remain blond. TR1-AG20, as well as TR1 will thus proceed to early commitment over TR101, as the state of the conflict is classified as a stay-out state. However, this would be inconsistent with the perspective of TR101-AG20 that “Jane's” hair color is red at the time of the evaluation of PR1010. In this example, the inconsistency may result in an unrecognized moving-in state created by the committed modification by TR1-AG20 and the referenced self-write by TR101-AG20. The inconsistency may further result in a violation of serializability and other consistency expectations.
Therefore, the inconsistencies created by the presence of referenced self-writes necessitate a modification of the aforementioned procedures. A possible modification targeting the epsilon checking procedure can be described as follows. Taking the example described above, in order to align the cell contents read by TR101-AG20 with the cell contents read by TR1-AG20, the epsilon checking procedure applied to TR1-AG20 should take into account any modifications made by TR101-AG20 prior to its evaluation of PR1010. Additionally, in some embodiments, there would not be any modification of data-cells from the moment that a conditional RV-entry for PR1010 is created until the moment that PR1010 is evaluated.
According to the present disclosure, the solution to the inconsistencies created by referenced self-writes may involve the creation of a trail of ordered data-access operations. Such operations may include the creation of both RV entries and WV entries. The trail of operations may be facilitated by an Intra-Transaction Data-Access Trail Order-ID (ITDAT Order-ID) and may increase monotonously according to a real-clock timestamp, a logical timestamp (such as a counter), and the like. In an embodiment, when created, WV-entries and RV-entries are each assigned an ITDAT Order ID such that those ITDAT Order ID values are unique and monotonously increasing. In an embodiment, such a trail is maintained separately for each transaction agent. That is, for example, if TR101 has two agents, TR101-AG20 and TR101-AG21, each of those agents will manage its own independent trail.
As applied to the example described above, the creation of an RV entry by TR101-AG20, the creation of a WV entry by TR101-AG20 would all be assigned an ITDAT Order ID. It should be noted that the creation of a WV-entry by TR1-AG20 would also result in assigning an ITDAT Order ID, although that fact will not be used by the following description. An amended epsilon procedure may be applied to TR1-AG20 and TR101-AG20 as follows. According to the amended epsilon checking procedure (e.g., by TR1-AG20), the evaluation of a predicate PR1010 for ε−(TR1) would read on the currently committed contents of the relevant data-cells, modified by any write operations by TR101-AG20 for the data-cells, up to the moment of TR101-AG20's conditional RV-entry, where the write operations modify the data-cells in the order denoted by their respective ITDAT Order-IDs. Similarly, the evaluation for ε+(TR1) would read on the currently committed contents of the relevant data-cells, modified by any write operations by TR1-AG20, and further modified by any write operations by TR101-AG20 for the data-cells, up to the moment of TR101-AG20's pertinent conditional RV-entry, where the write operations by TR101-AG20 modify the data-cells in the order denoted by their respective ITDAT Order-IDs. Utilizing this amended epsilon checking procedure, the presence of referenced self-writes may not result in an inconsistency of cell-content reads as between TR101-AG20 and TR1-AG20, and principles of serializability and consistency expectations may be preserved.
It should be noted that the amended epsilon checking procedure using ITDAT Order-IDs is merely one method for identifying and accommodating referenced self-writes. Any alternative method for accommodating referenced self-writes may be compatible with the present disclosure. For example, any alternative method that uses ordering of RV and WV entries that do not utilize ITDAT Order IDs may be compatible with the present disclosure.
The predictive CCP disclosed herein supports database operations performed on data rows. Such operations include inserting a row, deleting a row, and modifying a row. These operations are performed while maintaining the serializability and concurrency execution of transactions.
The processing circuitry 1010 may be realized as one or more hardware logic components and circuits. For example, and without limitation, illustrative types of hardware logic components that can be used include field programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), Application-specific standard products (ASSPs), system-on-a-chip systems (SOCs), graphics processing units (GPUs), tensor processing units (TPUs), general-purpose microprocessors, microcontrollers, digital signal processors (DSPs), and the like, or any other hardware logic components that can perform calculations or other manipulations of information.
The memory 1020 may be volatile (e.g., random access memory, etc.), non-volatile (e.g., read-only memory, flash memory, etc.), or a combination thereof.
In one configuration, software for implementing one or more embodiments disclosed herein may be stored in storage 1030. In another configuration, the memory 1020 is configured to store such software. Software shall be construed broadly to mean any type of instructions, whether referred to as software, firmware, middleware, microcode, hardware description language, or otherwise. Instructions may include code (e.g., in source code format, binary code format, executable code format, or any other suitable format of code). The instructions, when executed by the processing circuitry 1010, cause the processing circuitry 1010 to perform the various processes described herein.
The storage 1030 may be magnetic storage, optical storage, and the like, and may be realized, for example, as flash memory or other memory technology, compact disk-read only memory (CD-ROM), Digital Versatile Disks (DVDs), or any other medium which can be used to store the desired information.
The network interface 1040 allows node 210 in the system database 120 to communicate with, for example, client devices, external or internal networks, and the like.
It should be understood that the embodiments described herein are not limited to the specific architecture illustrated in
It is important to note that the embodiments disclosed herein are only examples of the many advantageous uses of the innovative teachings herein. In general, statements made in the specification of the present application do not necessarily limit any of the various claimed embodiments. Moreover, some statements may apply to some inventive features but not to others. In general, unless otherwise indicated, singular elements may be in plural and vice versa with no loss of generality. In the drawings, like numerals refer to like parts through several views.
The various embodiments disclosed herein can be implemented as hardware, firmware, software, or any combination thereof. Moreover, the software may be implemented as an application program tangibly embodied on a program storage unit or computer readable medium consisting of parts, or of certain devices and/or a combination of devices. The application program may be uploaded to, and executed by, a machine comprising any suitable architecture. Preferably, the machine is implemented on a computer platform having hardware such as one or more central processing units (“CPUs”), a memory, and input/output interfaces. The computer platform may also include an operating system and microinstruction code. The various processes and functions described herein may be either part of the microinstruction code or part of the application program, or any combination thereof, which may be executed by a CPU, whether or not such a computer or processor is explicitly shown. In addition, various other peripheral units may be connected to the computer platform such as an additional data storage unit and a printing unit. Furthermore, a non-transitory computer readable medium is any computer readable medium except for a transitory propagating signal.
All examples and conditional language recited herein are intended for pedagogical purposes to aid the reader in understanding the principles of the disclosed embodiment and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions. Moreover, all statements herein reciting principles, aspects, and embodiments of the disclosed embodiments, as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof. Additionally, it is intended that such equivalents include both currently known equivalents as well as equivalents developed in the future, i.e., any elements developed that perform the same function, regardless of structure.
It should be understood that any reference to an element herein using a designation such as “first,” “second,” and so forth does not generally limit the quantity or order of those elements. Rather, these designations are generally used herein as a convenient method of distinguishing between two or more elements or instances of an element. Thus, a reference to first and second elements does not mean that only two elements may be employed there or that the first element must precede the second element in some manner. Also, unless stated otherwise, a set of elements comprises one or more elements.
As used herein, the phrase “at least one of” followed by a listing of items means that any of the listed items can be utilized individually, or any combination of two or more of the listed items can be utilized. For example, if a system is described as including “at least one of A, B, and C,” the system can include A alone; B alone; C alone; 2A; 2B; 2C; 3A; A and B in combination; B and C in combination; A and C in combination; A, B, and C in combination; 2A and C in combination; A, 3B, and 2C in combination; and the like.
This application claims the benefit of U.S. Provisional Application No. 63/600,145 filed on Nov. 17, 2023, the contents of which are hereby incorporated by reference. The subject matter of the present application relates to U.S. patent application Ser. No. 18/341,279 filed Jun. 26, 2023. The contents of the Ser. No. 18/341,279 application are hereby incorporated by reference. The subject matter of the present application also relates to U.S. patent application Ser. No. 18/944,462 filed Nov. 12, 2024, the contents of the Ser. No. 18/944,462 application are hereby incorporated by reference. The subject matter of the present application also relates to U.S. patent application Ser. No. 18/945,062 filed Nov. 12, 2024, the contents of the Ser. No. 18/945,062 application are hereby incorporated by reference.
| Number | Date | Country | |
|---|---|---|---|
| 63600145 | Nov 2023 | US |