The present invention relates to memory tracking for an elastic distributed graph-processing system.
A graph database is a database that uses graph structures for semantic queries with nodes, edges, and properties to represent and store data. A graph relates data items in the store to a collection of nodes and edges, the edges representing the relationships between the nodes. The relationships allow data in the store to be linked together directly and, in many cases, retrieved with one operation. Graph databases hold the relationships between data as a priority. The underlying storage mechanism of graph databases can vary. Relationships are a first-class citizen in a graph database and can be labeled, directed, or given properties. Some implementations use a relational engine and store the graph data in a table.
Many applications of graph database processing involve processing increasingly large graphs that do not fit in a single machine's memory. Distributed graph processing engines partition the graph among multiple machines and execute graph processing operations in the multiple machines, potentially in parallel, with communication of intermediate results between machines. Distributed graph processing engines can be implemented in cloud environments to provide dynamic scalability as graph sizes increase.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Further, it should not be assumed that any of the approaches described in this section are well-understood, routine, or conventional merely by virtue of their inclusion in this section.
In the drawings:
In the following description, for the purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It will be apparent, however, that the present invention may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form in order to avoid unnecessarily obscuring the present invention.
In a distributed graph processing system, graph data is partitioned and loaded into the main memory of multiple machines in a distributed manner. Various user operations, such as graph queries or graph algorithms, can then be efficiently executed over this data. The illustrative embodiments introduce a resource manager design for an elastic/scale-out multi-user distributed graph processing system, where several users simultaneously execute operations while the system automatically scales out when under memory pressure. The resource manager is responsible for the centralized tracking of memory utilization across those machines. Graph data can be inherently skewed (e.g., social network graphs that follow a power-law degree distribution), and graph operations can produce skewed amounts of output data, even under uniform graph data partitioning. The illustrative embodiments provide a resource manager design that builds on memory pre-reservations and reservations to capture this complex nature of graph operations. Graph operations reserve memory-sometimes based on a heuristic estimation of their memory requirement-while allocating memory solely from their reservation. Memory reservations are characterized as fixed or flexible, such that operations that rely on the graph data distribution over machines can accurately express their requirements in the face of elasticity. Altogether, the resource manager in the illustrative embodiments can bind with any external interface, such as the control plane of a cloud environment, to allow for elasticity (adding machines to or removing machines from the cluster). The resource manager of the illustrative embodiments guarantees that all operations will eventually complete if there is enough memory for each individual operation and can leverage evictions of data from memory, as well as graph rebalancing to make enough memory available for ongoing operations to execute.
The illustrative embodiments introduce a resource manager component for distributed graph processing that supports elasticity with the diverse set of graph operations using a reservation-first approach. For operations such as loading a graph, flexible reservations give the resource manager the flexibility to use all available memory of the distributed system. Therefore, the partitioned graph data can be assigned to machines in a manner that allows for the highest degree of parallelism and for the best performance. However, graph operations, such as graph pattern matching, can nevertheless produce skewed amounts of output data that cannot be predicted beforehand, even when the graph data has been uniformly distributed. Fixed reservations inform the resource manager that a given operation will invariantly require a certain amount of memory on a specific machine, as dictated by the graph's characteristics and partitioning. Additionally, pre-reservations provide a fail-fast mechanism of identifying whether an operation can start/resume, whether it must remain on hold, or whether it must be cancelled if its requirements are deemed impossible to satisfy. Reservations also allow the resource manager to avoid live-locks in the presence of concurrency in a simple yet effective way.
The resource manager of the illustrative embodiments uses very few and simple concepts (i.e., pre-reservations, flexible reservations, fixed reservations), yet the resource manager can express the very diverse set of operations available in distributed graph processing. Additionally, the resource manager makes granular-only coordination for reservations; therefore, most allocations are local and with no overhead. Finally, the resource manager uses very simple policies to avoid live-locks.
Adversarial (or unlucky) execution schedules might create jobs that over-request resources (e.g., a job asks for 8 GB and is put on hold and the resource manager resumes the job with 16 GB when the job could actually need 8.1 GB). Additionally, the system can theoretically become stuck in a pessimistic execution mode due to consecutive on-hold operations. Nevertheless, in practice, these sub-optimal schedules should rarely occur, and employing mechanisms, such as heuristically limiting the number of times a job can be put on hold, can help mitigate these issues.
Graph processing is a challenging workload for database management systems (DBMSs), exposing a number of key user operations on graph data, such as graph algorithms (e.g., PageRank) and graph queries. For example, the operation, “find the friends of my friends with whom I have the most common friends,” can be expressed as a graph query in Query 1, in Property Graph Query Language (PGQL), as follows:
On one hand, graph algorithms typically iterate repeatedly over the graphs and often some data structure abstractions, such as hash maps and priority queues inherently in their implementations. On the other hand, graph queries match patterns on the graph and, further, as seen in the above example, perform traditional relational operations, such as GROUP BY and ORDER BY. A “complete” graph processing system includes various data structures: graph, graph delta (a new snapshot of the graph after some updates, such as vertex insertions, are performed), hash maps, priority queues, tables/frames (a tabular relational structure that holds the results of the query patterns and enables relational processing, such as GROUP BY).
In a distributed graph processing system, all these data structures are partitioned across the machines of the cluster. In an embodiment, the DBMS comprises a resource manager, which handles database processes for the database system. The resource manager may be a background daemon, a database component, software module, or some combination thereof. The resource manager may monitor database instance(s) and track processor and I/O resources across database processes. In an embodiment, the resource manager is a process scheduler that interrupts, de-schedules, schedules, or otherwise controls when database processes may run. In an elastic distributed graph system, i.e., a system where more machines can be added on-demand to increase the total memory and compute capacity, or respectively machines can be removed, the resource manager component tracks and enforces how memory is used within and across machines. Memory is tracked across the various jobs that implement user commands, and across in-memory objects, such as loaded graphs. Typically, these systems also provide multi-user support, resulting in many parallel jobs executing at the same time, contending for resources. Altogether, for a multi-user elastic graph processing system (i.e., the core of any graph service in the Cloud) to operate properly, such a resource manager in an embodiment does:
To highlight, the complexity of requirement (6) above further, consider loading a graph G from files to distributed memory and executing on G a Query 2, which is similar to Query 1 above, as follows:
Loading the graph in memory can use accurate heuristic techniques (such as data sampling) to estimate the size of the graph in advance before the actual loading to memory. Given the estimation, the resource manager has the freedom to decide how the memory should be reserved and allocated on the machines of the distributed system to better serve the needs, balance, and/or performance of the system—i.e., graph loading is rather flexible in terms of memory placement.
However, the graph query executes on top of the already placed graph; hence, the produced intermediate and final results are deterministically placed based on the graph partitioning and placement (e.g., the results of this query will be stored on the machines that hold the last matched element—the ‘p3’ vertices). In addition, it is practically impossible to have an accurate prediction of how much memory the query will consume before execution. In this example, the placement of persons with ‘p1.country=“Italy” AND p3.country=“France”’ will dictate where the data will go. However, even if these persons are all placed on a single machine out of a 16-machine cluster, for example, the resource manager could decide to execute GROUP BY on all 16 machines if the single machine is under memory pressure. The GROUP BY operation typically relies on hash functions to place data on machines; therefore, almost any implementation can support this (or the tabular data could be reshuffled before executing GROUP BY in an alternative implementation).
For any part of a query's execution (matching, ordering, grouping), the system may run out of memory. Therefore, the resource manager tracks and properly “negotiates” with the control plane of the cluster if memory can be added to the system and how much, in the form of additional machines.
Reservation-Driven Memory Tracking Model with Centralized Memory Utilization Tracking
The illustrative embodiments introduce a resource manager design for distributed graphs that relies on a reservation-driven memory tracking model. This model enables the resource manager to have centralized, global, accurate, and up-to-date memory usage information. It also keeps a low overhead for tracking individual allocations and checking them against user-defined resource limits. In the rest of this disclosure, any user command, such as a graph query, is executed by a job in the distributed graph engine. Jobs request memory from the resource manager. In one embodiment, key aspects of the design are as follows:
In one embodiment, memory reservations are tracked for each job running in the cluster as follows:
In various embodiments, the resource manager can be implemented either as a centralized or a decentralized component. In the centralized implementation, a leader machine receives and processes all reservation requests from jobs executing in the cluster. In the decentralized implementation, every machine implements the resource manager logic. In one embodiment, the resource manager comprises a plurality of resource manager instances running on respective machines in the cluster, and each resource manager instance propagates changes to the amount of available cluster memory and the amount of available machine memory in each machine in the cluster to other resource manager instances using a replication protocol. An example of the replication protocol is a consensus protocol, such as Raft, which handles agreement. In one alternative embodiment, the resource manager core can be built as a completely external component to the graph runtime system, communicating e.g., with REST application programming interfaces (APIs).
If the resource manager detects that a job cannot be executed due to lack of memory, it puts the job on hold and will take one or more out of several potential actions (described later and dependent on the specific graph system) to find memory and try to resume this job. Putting the job on hold and resuming it later can be implemented in two main ways:
The resource manager of the illustrative embodiments works with both approaches. When a job is put on hold, all the job's reservations are released to allow for other useful work to proceed, but the resource manager keeps track of how much memory was used for when the on-hold job is continued.
Different graph operations can have very different memory usage characteristics. For graph loading, allocations include intermediate data structures while loading and final graph structure once completed, memory placement includes fully flexible graph vertices, edges, and properties, and the memory can be quite accurately estimated before execution. For a graph operation that filters vertices to get a vertex set on top of the graph, allocations include a set containing the vertex identifiers (IDs) to be included in the set (maximum size=number of vertices), memory placement must follow the exact same partitioning as the underlying graph, and maximum memory can be estimated before execution. For the PageRank graph algorithm, allocations include a graph property array (of numerical type double) that will hold the PageRank output value per index (most implementations also include a second, temporary array (also of type double)), memory placement must follow the exact same partitioning as the underlying graph, and required memory is known before execution. For a query with GROUP BY, allocations are implementation dependent, but in principle include memory for holding intermediate results of pattern matching, memory for hashmaps for GROUP BY, and final memory produced. Pattern matching produces results based on the graph partitioning, the pattern, and the filters. Memory placement for GROUP BY is typically quite flexible. A query with GROUP BY is not estimation friendly for the pattern matching part; however, for GROUP BY, memory is capped by the size of the intermediate results.
Fulfilling these requirements becomes even more complex if the resource manager accounts for the out-of-memory and job-placed-on-hold situations that these jobs must support. To capture these requirements in an intuitive way, the illustrative embodiments introduce three types of reservations as follows:
A job X requires some amount Y of memory. Machine locality is not required. It is guaranteed that the cluster currently holds at least Y memory. Y memory is required anywhere in the cluster. Pre-reserved memory will subsequently be bound to one or more machines before the job can allocate memory from the pre-reserved memory.
A job X requires some amount Y of memory on machine Z. Machine locality is required. Machine locality may change upon on-hold/restart. It is guaranteed that the cluster currently holds at least Y memory. It is guaranteed that at least Y memory resides on machine Z. There is no guarantee where the memory resides if the job is restarted. Y memory is required on machine Z, but upon the job being on hold, this requirement can be moved.
A job X requires some amount Y of memory on machine Z. Machine locality is required and must stay the same upon on-hold/restart. It is guaranteed that the cluster currently holds at least Y memory. It is guaranteed that at least Y memory resides on machine Z. It is guaranteed that once the job is restarted, Y memory resides on machine Z. Y memory is required on machine Z for this job to succeed.
As an example, consider a graph query that does pattern matching GROUP BY and ORDER BY, Query 3 above, as follows:
Query 3 would perform the following reservations (for a simplified view of a specific implementation):
When a job tries to reserve more memory than is currently available in the cluster, the job can either be put on hold or be cancelled. Putting a job on hold happens when the resource manager expects that memory might become available (e.g., more machines might join the cluster, some other jobs running will finish and release memory, etc.). All the current reservations of the job are released, all the objects it has created are destroyed, and the job is queued waiting for memory to become available. As mentioned earlier, in alternative implementations, jobs can release their allocations but save their intermediate state to restore later when resumed. The job will be restarted once its reservation requirements are satisfied. Nevertheless, the graph processing system can easily hide this on-hold action from the user such that the user only observes the final result of the command/job being later successfully completed.
Cancelling a job happens when the resource manager hits the maximum amount of memory that can ever be attained by the system (i.e., the control plane informs the system that no more machines can join). All the currently reserved memory of the job is released, all the objects it has created are destroyed, and the job is not requeued. In this case, the user receives an actual out-of-memory error message.
A job can reserve memory in any of the three methods described above, multiple times during its execution, depending on requirements. A job cannot unreserve memory explicitly. This is to guarantee that the job always makes progress. A job can ask the resource manager to reduce a reservation by trimming the reservation, but this is not guaranteed. The behavior is similar to asking a Java™ runtime to invoke the garbage collector. A job will convert any pre-reserved memory into either a flexible or fixed reservation before performing any allocations from the reservation. This can be done explicitly, by specifying which machine the memory must reside on, or implicitly, by asking the resource manager to distribute the memory as best as it can.
Because a job will subsequently convert any pre-reserved memory into either a flexible or fixed reservation before performing any allocations from the reservation, operation then proceeds to block 204 to determine whether the same locality is required on resume if the job is put on hold. Similarly, if the job determines that locality does matter (block 201: YES), then the job determines whether the same locality is required on resume (block 204). If the same locality is required on resume (block 204: YES), then the job makes a fixed reservation request (block 205). The fixed reservation request guarantees that the required amount of memory is available on a specified machine and that the required amount of memory will reside on the specified machine if the job is restarted. If the same locality is not required on resume (block 204: NO), then the job makes a flexible reservation (block 206). The flexible reservation request guarantees that the required amount of memory is available on a specified machine but does not guarantee that the required amount of memory will reside on the specified machine if the job is restarted.
During execution of a graph processing operation, such as a query, the job may determine that more memory is required. Thus, after making the fixed reservation in block 205 or the flexible reservation in block 206, operation may proceed to block 201 for the next memory reservation.
In blocks 203, 205, and 206, the amount of memory is reserved with the resource manager, which performs centralized tracking of memory utilization across those machines; however, reserving memory does not allocate memory to the machines. Therefore, after making the fixed reservation in block 205 or making the flexible reservation in block 206, the job performs an allocation to allocate the required amount of memory to the specified machine (block 207). Operation may then proceed to block 201 for the next memory reservation.
When the job completes execution of the graph processing operation, then the job releases reserved memory (block 208), and operation ends (block 209).
If the job determines that memory is not available (block 202: NO), then the job determines whether memory can be added (block 210). The job may determine whether memory can be added based on whether the resource manager returns an out-of-memory error, for example. If the job determines that memory can be added, then the job is put on hold (block 211). The job may be restarted from the on-hold state. When a job is put on hold, all the job's reservations are released to allow for other useful work to proceed, but the resource manager keeps track of how much memory was used for when the on-hold job is continued. If memory cannot be added (block 210: NO), then operation ends (block 209).
To reserve memory, a job submits a reservation request to the resource manager. The request contains information about the required amount of memory, the type of reservation, and the job that is making the reservation. The resource manager evaluates the feasibility of the request and sends a response to confirm or deny the request. If the request is denied, then the job cannot continue and is put on hold. Although the resource manager knows that Y amount of memory is required on machine Z, when a job is put on hold, the requirement may become stale.
A job might not accurately know its memory requirements. Therefore, in some embodiments, the job might try to allocate more memory than it has reserved (or might not reserve memory explicitly at all). In some embodiments, any allocation greater than the existing memory reservation will implicitly attempt to increase the existing reservation. This may be transparent to the rest of the job implementation.
The resource manager 350 determines whether the reservation is possible (block 302). If the reservation is possible (block 302: YES), then the resource manager returns a response to process 2 indicating that a reservation of 3 GB is successful (block 303). In some embodiments, memory reservations are created and increased in mid/coarse increments to reduce the number of requests sent to the resource manager. This parameter is configurable and transparent to the implementation of any job. As mentioned above, in any cluster with server machines with more than 128 GB of memory, the reservation granularity could be set between 128 MB and 512 MB of memory. When a job submits a reservation request, it specifies the minimum amount of memory required (2.5 GB in the example of
If the reservation is not possible (block 302: NO), then the resource manager notifies the other processes (process 1 311, . . . , process N 313) that job 1 is put on hold (block 304) and returns a response to process 2 indicating that the reservation fails and that job 1 is put on hold (block 305). Operation repeats for each process attempting to reserve memory until the job is put on hold or the job completes the graph processing operation.
When a job is put on hold and resumed, its existing reservations are converted in the following manner: a pre-reservation is maintained across restarts; flexible reservations are converted to pre-reservations on restart; and fixed reservations are maintained on restart. In some embodiments, the resource manager supports setting limits on memory consumption. The limits are enforced in the following manner:
There are certain memory objects that outlive the jobs that create them, such as graphs, tables/frames, query result sets, algorithm output, etc. The most prominent example is loading a graph from files or from a relational database, which translates to a job that creates the graph data structure and, upon completion, makes the graph available to the user for processing. The resource manager tracks these objects in order to accurately reflect the available memory in the cluster. The resource manager does this in the following manner:
In that sense, the job is like a transaction: any object created from reserved memory may be either destroyed or committed to the resource manager at the end of its execution.
Thereafter, job 1 sends a reservation request to resource manager 350 to request reservation for 0.5 GB of memory, and the resource manager 350 sends a response back to job 1 granting the reservation (block 403). Job 1 then allocates 200 MB of memory for a data object (obj2) (block 404). Job 1 then deletes obj1 (block 405) and commits obj2 to persistent object memory (block 406). Resource manager 350 tracks 200 MB of persistent object memory for obj2. Thus, even after job 1 ends (block 407), the resource manager 350 ensures that the 200 MB for obj2 remains unavailable for subsequent reservations.
In accordance with the illustrative embodiment, the system adheres to the following:
available_cluster_memory=physical_cluster_memory−reserved_cluster_memmory−persistent_object_memory
available_machine_memory[machine X]=physical_memory[machine X]−reserved_memory[machine X]−persistent_object_memory[machine X]
When a job tries to implicitly increase its reservation (i.e., tries to allocate more memory than it had reserved), it is possible that it is put on hold. If the allocations are small, and there is memory contention, then simply using the existing reservation as a pre-requirement for restarting the job can lead to premature wakeups. For example, a job tries to execute a sequence of 100 allocations, each allocating 500 MB. If the reservation granularity is 1 GB, then, in the extreme case when all reservations fail, the job could be put on hold and resumed up to 50 times. Thus, in accordance with one embodiment, the resource manager increases the reservation requirements for jobs put on hold from implicit reservations using a configurable function. For instance, a simple exponential factor can be configured, with an implicit value of 2×. This behavior is similar to dynamically-sizing containers in most modern programming languages which use this technique to reduce the number of reallocations.
As shown in the depicted example, upon restart and reservation number 1, the reservation increment without the factor is 1 GB and, thus, the amount of memory reserved is 1 GB; however, with the 2× factor, the reservation increment is 2 GB, and the amount of memory reserved is 2 GB. After restart number 2, the amount of memory reserved without the factor is 2 GB, and the amount of memory reserved with the factor is 4 GB. This continues until restart number 6, at which point the amount of memory reserved without the factor is 6 GB, and the amount of memory reserved with the 2× factor is 64 GB. Thus, the required amount of 50 GB is achieved after 6 restarts using the 2× factor, while 50 restarts are required to achieve the required amount of 50 GB without the 2× factor.
The scenario exemplified in
As shown in
In some embodiments, the resource manager uses the configurable function to increase the reservation requirements in response to detecting a job that is put on hold and restarted due to implicit reservations and memory contention.
Preventing Live-Locks when Jobs are Put on Hold
In many systems, user commands are executed with some form of first-come-first served policy. Some systems further offer command/job priorities, which allow high-priority jobs to bypass lower priority jobs when the system is overloaded. Additionally, not all commands are typically submitted immediately in the presence of contention, i.e., jobs are queued. One embodiment provides a simple but effective policy for avoiding live-locks where newer incoming jobs would prohibit on-hold jobs from completing.
Then, at timestamp 2, the available memory is 10 GB (100 GB-90 GB for job C), the on-hold jobs still include job A attempting to reserve 250 GB, there are no new jobs, and the expected extra memory is 2*100 GB (two machines expected to be added to the cluster with 100 GB each due to elasticity). As seen in this example, job C used the 90 GB, which prevented job A from reserving the requested 250 GB, thus resulting in job A being put on hold again.
At timestamp 3, the available memory is 210 GB (10 GB+200 GB added due to elasticity), the on-hold jobs still include job A attempting to reserve 250 GB, a new job D attempts to reserve 200 GB, and the expected extra memory is 1*100 GB (one machine expected to be added to the cluster with 100 GB due to elasticity). As seen in this example, two machines are added to the cluster with 100 GB each, but job A still cannot start due to the amount of available memory being insufficient to satisfy the reservation request.
At timestamp 4, the available memory is 10 GB (210 GB-200 GB for job D), the on-hold jobs still include job A attempting to reserve 250 GB, there are no new jobs, and the expected extra memory is 1*100 GB (one machine expected to be added to the cluster with 100 GB due to elasticity). As seen in this example, job D used the 200 GB, and job A again fails to reserve the requested 250 GB, thus resulting in job A being put on hold again.
At timestamp 5, the available memory is 110 GB (10 GB+100 GB added due to elasticity), the on-hold jobs still include job A, there are no new jobs, and the expected extra memory is 0 GB. The one machine with 100 GB joined the cluster, but the system has reached the maximum memory consumption and cannot add more machines. Job A is starved of resources, even though jobs C and D were able to start and reserve memory.
In accordance with an illustrative embodiment, the system allows jobs to be submitted as long as the system would normally submit them; however, if the resource manager detects potential unfairness, it automatically reverts to a more pessimistic execution model until the affected jobs have the opportunity to complete. Thus, in one embodiment, adding new machines is correlated with jobs that cannot progress otherwise. The resource manager goes into pessimistic mode when either: the elastic machine additions complete and the “requestor” jobs still cannot execute, or the maximum possible growth of memory cannot go further after the current incoming machines are added. For instance, in the example shown in
The pessimistic mode, in its simpler form, does not allow any new jobs to execute until the on-hold jobs have had an opportunity. The pessimistic mode can be made more effective by:
The above-mentioned policies can be adjusted, or different policies can be devised to serve the business requirements of the graph processing system deployment. For instance, the most pessimistic policy could switch to single command execution the moment there is a single on-hold job.
A database management system (DBMS) manages a database. A DBMS may comprise one or more database servers. A database comprises database data and a database dictionary that are stored on a persistent memory mechanism, such as a set of hard disks. Database data may be stored in one or more collections of records. The data within each record is organized into one or more attributes. In relational DBMSs, the collections are referred to as tables (or data frames), the records are referred to as records, and the attributes are referred to as attributes. In a document DBMS (“DOCS”), a collection of records is a collection of documents, each of which may be a data object marked up in a hierarchical-markup language, such as a JSON object or XML document. The attributes are referred to as JSON fields or XML elements. A relational DBMS may also store hierarchically marked data objects; however, the hierarchically marked data objects are contained in an attribute of record, such as JSON typed attribute.
Users interact with a database server of a DBMS by submitting to the database server commands that cause the database server to perform operations on data stored in a database. A user may be one or more applications running on a client computer that interacts with a database server. Multiple users may also be referred to herein collectively as a user.
A database command may be in the form of a database statement that conforms to a database language. A database language for expressing the database commands is the Structured Query Language (SQL). There are many different versions of SQL; some versions are standard and some proprietary, and there are a variety of extensions. Data definition language (“DDL”) commands are issued to a database server to create or configure data objects referred to herein as database objects, such as tables, views, or complex data types. SQL/XML is a common extension of SQL used when manipulating XML data in an object-relational database. Another database language for expressing database commands is Spark™ SQL, which uses a syntax based on function or method invocations.
In a DOCS, a database command may be in the form of functions or object method calls that invoke CRUD (Create Read Update Delete) operations. An example of an API for such functions and method calls is MQL (MongoDB™ Query Language). In a DOCS, database objects include a collection of documents, a document, a view, or fields defined by a JSON schema for a collection. A view may be created by invoking a function provided by the DBMS for creating views in a database.
Changes to a database in a DBMS are made using transaction processing. A database transaction is a set of operations that change database data. In a DBMS, a database transaction is initiated in response to a database command requesting a change, such as a DML command requesting an update, insert of a record, or a delete of a record or a CRUD object method invocation requesting to create, update or delete a document. DML commands and DDL specify changes to data, such as INSERT and UPDATE statements. A DML statement or command does not refer to a statement or command that merely queries database data. Committing a transaction refers to making the changes for a transaction permanent.
Under transaction processing, all the changes for a transaction are made atomically. When a transaction is committed, either all changes are committed, or the transaction is rolled back. These changes are recorded in change records, which may include redo records and undo records. Redo records may be used to reapply changes made to a data block. Undo records are used to reverse or undo changes made to a data block by a transaction.
An example of such transactional metadata includes change records that record changes made by transactions to database data. Another example of transactional metadata is embedded transactional metadata stored within the database data, the embedded transactional metadata describing transactions that changed the database data.
Undo records are used to provide transactional consistency by performing operations referred to herein as consistency operations. Each undo record is associated with a logical time. An example of logical time is a system change number (SCN). An SCN may be maintained using a Lamporting mechanism, for example. For data blocks that are read to compute a database command, a DBMS applies the needed undo records to copies of the data blocks to bring the copies to a state consistent with the snap-shot time of the query. The DBMS determines which undo records to apply to a data block based on the respective logical times associated with the undo records.
In a distributed transaction, multiple DBMSs commit a distributed transaction using a two-phase commit approach. Each DBMS executes a local transaction in a branch transaction of the distributed transaction. One DBMS, the coordinating DBMS, is responsible for coordinating the commitment of the transaction on one or more other database systems. The other DBMSs are referred to herein as participating DBMSs.
A two-phase commit involves two phases, the prepare-to-commit phase, and the commit phase. In the prepare-to-commit phase, branch transaction is prepared in each of the participating database systems. When a branch transaction is prepared on a DBMS, the database is in a “prepared state” such that it can guarantee that modifications executed as part of a branch transaction to the database data can be committed. This guarantee may entail storing change records for the branch transaction persistently. A participating DBMS acknowledges when it has completed the prepare-to-commit phase and has entered a prepared state for the respective branch transaction of the participating DBMS.
In the commit phase, the coordinating database system commits the transaction on the coordinating database system and on the participating database systems. Specifically, the coordinating database system sends messages to the participants requesting that the participants commit the modifications specified by the transaction to data on the participating database systems. The participating database systems and the coordinating database system then commit the transaction.
On the other hand, if a participating database system is unable to prepare or the coordinating database system is unable to commit, then at least one of the database systems is unable to make the changes specified by the transaction. In this case, all of the modifications at each of the participants and the coordinating database system are retracted, restoring each database system to its state prior to the changes.
A client may issue a series of requests, such as requests for execution of queries, to a DBMS by establishing a database session. A database session comprises a particular connection established for a client to a database server through which the client may issue a series of requests. A database session process executes within a database session and processes requests issued by the client through the database session. The database session may generate an execution plan for a query issued by the database session client and marshal slave processes for execution of the execution plan.
The database server may maintain session state data about a database session. The session state data reflects the current state of the session and may contain the identity of the user for which the session is established, services used by the user, instances of object types, language and character set data, statistics about resource usage for the session, temporary variable values generated by processes executing software within the session, storage for cursors, variables, and other information.
A database server includes multiple database processes. Database processes run under the control of the database server (i.e., can be created or terminated by the database server) and perform various database server functions. Database processes include processes running within a database session established for a client.
A database process is a unit of execution. A database process can be a computer system process or thread or a user-defined execution context such as a user thread or fiber. Database processes may also include “database server system” processes that provide services and/or perform functions on behalf of the entire database server. Such database server system processes include listeners, garbage collectors, log writers, and recovery processes.
A multi-node database management system is made up of interconnected computing nodes (“nodes”), each running a database server that shares access to the same database. Typically, the nodes are interconnected via a network and share access, in varying degrees, to shared storage, e.g., shared access to a set of disk drives and data blocks stored thereon. The nodes in a multi-node database system may be in the form of a group of computers (e.g., workstations, personal computers) that are interconnected via a network. Alternately, the nodes may be the nodes of a grid, which is composed of nodes in the form of server blades interconnected with other server blades on a rack.
Each node in a multi-node database system hosts a database server. A server, such as a database server, is a combination of integrated software components and an allocation of computational resources, such as memory, a node, and processes on the node for executing the integrated software components on a processor, the combination of the software and computational resources being dedicated to performing a particular function on behalf of one or more clients.
Resources from multiple nodes in a multi-node database system can be allocated to running a particular database server's software. Each combination of the software and allocation of resources from a node is a server that is referred to herein as a “server instance” or “instance.” A database server may comprise multiple database instances, some or all of which are running on separate computers, including separate server blades.
A database dictionary may comprise multiple data structures that store database metadata. A database dictionary may, for example, comprise multiple files and tables. Portions of the data structures may be cached in main memory of a database server.
When a database object is said to be defined by a database dictionary, the database dictionary contains metadata that defines properties of the database object. For example, metadata in a database dictionary defining a database table may specify the attribute names and data types of the attributes, and one or more files or portions thereof that store data for the table. Metadata in the database dictionary defining a procedure may specify a name of the procedure, the procedure's arguments and the return data type, and the data types of the arguments, and may include source code and a compiled version thereof.
A database object may be defined by the database dictionary, but the metadata in the database dictionary itself may only partly specify the properties of the database object. Other properties may be defined by data structures that may not be considered part of the database dictionary. For example, a user-defined function implemented in a JAVA class may be defined in part by the database dictionary by specifying the name of the user-defined function and by specifying a reference to a file containing the source code of the Java class (i.e., .java file) and the compiled version of the class (i.e., .class file).
Native data types are data types supported by a DBMS “out-of-the-box.” Non-native data types, on the other hand, may not be supported by a DBMS out-of-the-box. Non-native data types include user-defined abstract types or object classes. Non-native data types are only recognized and processed in database commands by a DBMS once the non-native data types are defined in the database dictionary of the DBMS, by, for example, issuing DDL statements to the DBMS that define the non-native data types. Native data types do not have to be defined by a database dictionary to be recognized as valid data types and to be processed by a DBMS in database statements. In general, database software of a DBMS is programmed to recognize and process native data types without configuring the DBMS to do so by, for example, defining a data type by issuing DDL statements to the DBMS.
According to one embodiment, the techniques described herein are implemented by one or more special-purpose computing devices. The special-purpose computing devices may be hard-wired to perform the techniques or may include digital electronic devices such as one or more application-specific integrated circuits (ASICs) or field programmable gate arrays (FPGAs) that are persistently programmed to perform the techniques or may include one or more general purpose hardware processors programmed to perform the techniques pursuant to program instructions in firmware, memory, other storage, or a combination. Such special-purpose computing devices may also combine custom hard-wired logic, ASICs, or FPGAs with custom programming to accomplish the techniques. The special-purpose computing devices may be desktop computer systems, portable computer systems, handheld devices, networking devices or any other device that incorporates hard-wired and/or program logic to implement the techniques.
For example,
Computer system 700 also includes a main memory 706, such as a random-access memory (RAM) or other dynamic storage device, coupled to bus 702 for storing information and instructions to be executed by processor 704. Main memory 706 also may be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 704. Such instructions, when stored in non-transitory storage media accessible to processor 704, render computer system 700 into a special-purpose machine that is customized to perform the operations specified in the instructions.
Computer system 700 further includes a read only memory (ROM) 708 or other static storage device coupled to bus 702 for storing static information and instructions for processor 704. A storage device 710, such as a magnetic disk, optical disk, or solid-state drive is provided and coupled to bus 702 for storing information and instructions.
Computer system 700 may be coupled via bus 702 to a display 712, such as a cathode ray tube (CRT), for displaying information to a computer user. An input device 714, including alphanumeric and other keys, is coupled to bus 702 for communicating information and command selections to processor 704. Another type of user input device is cursor control 716, such as a mouse, a trackball, or cursor direction keys for communicating direction information and command selections to processor 704 and for controlling cursor movement on display 712. This input device typically has two degrees of freedom in two axes, a first axis (e.g., x) and a second axis (e.g., y), that allows the device to specify positions in a plane.
Computer system 700 may implement the techniques described herein using customized hard-wired logic, one or more ASICs or FPGAs, firmware and/or program logic which in combination with the computer system causes or programs computer system 700 to be a special-purpose machine. According to one embodiment, the techniques herein are performed by computer system 700 in response to processor 704 executing one or more sequences of one or more instructions contained in main memory 706. Such instructions may be read into main memory 706 from another storage medium, such as storage device 710. Execution of the sequences of instructions contained in main memory 706 causes processor 704 to perform the process steps described herein. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions.
The term “storage media” as used herein refers to any non-transitory media that store data and/or instructions that cause a machine to operate in a specific fashion. Such storage media may comprise non-volatile media and/or volatile media. Non-volatile media includes, for example, optical disks, magnetic disks, or solid-state drives, such as storage device 710. Volatile media includes dynamic memory, such as main memory 706. Common forms of storage media include, for example, a floppy disk, a flexible disk, hard disk, solid-state drive, magnetic tape, or any other magnetic data storage medium, a CD-ROM, any other optical data storage medium, any physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, NVRAM, any other memory chip or cartridge.
Storage media is distinct from but may be used in conjunction with transmission media. Transmission media participates in transferring information between storage media. For example, transmission media includes coaxial cables, copper wire and fiber optics, including the wires that comprise bus 702. Transmission media can also take the form of acoustic or light waves, such as those generated during radio-wave and infra-red data communications.
Various forms of media may be involved in carrying one or more sequences of one or more instructions to processor 704 for execution. For example, the instructions may initially be carried on a magnetic disk or solid-state drive of a remote computer. The remote computer can load the instructions into its dynamic memory and send the instructions over a telephone line using a modem. A modem local to computer system 700 can receive the data on the telephone line and use an infra-red transmitter to convert the data to an infra-red signal. An infra-red detector can receive the data carried in the infra-red signal and appropriate circuitry can place the data on bus 702. Bus 702 carries the data to main memory 706, from which processor 704 retrieves and executes the instructions. The instructions received by main memory 706 may optionally be stored on storage device 710 either before or after execution by processor 704.
Computer system 700 also includes a communication interface 718 coupled to bus 702. Communication interface 718 provides a two-way data communication coupling to a network link 720 that is connected to a local network 722. For example, communication interface 718 may be an integrated services digital network (ISDN) card, cable modem, satellite modem, or a modem to provide a data communication connection to a corresponding type of telephone line. As another example, communication interface 718 may be a local area network (LAN) card to provide a data communication connection to a compatible LAN. Wireless links may also be implemented. In any such implementation, communication interface 718 sends and receives electrical, electromagnetic or optical signals that carry digital data streams representing various types of information.
Network link 720 typically provides data communication through one or more networks to other data devices. For example, network link 720 may provide a connection through local network 722 to a host computer 724 or to data equipment operated by an Internet Service Provider (ISP) 726. ISP 726 in turn provides data communication services through the world-wide packet data communication network now commonly referred to as the “Internet” 728. Local network 722 and Internet 728 both use electrical, electromagnetic, or optical signals that carry digital data streams. The signals through the various networks and the signals on network link 720 and through communication interface 718, which carry the digital data to and from computer system 700, are example forms of transmission media.
Computer system 700 can send messages and receive data, including program code, through the network(s), network link 720 and communication interface 718. In the Internet example, a server 730 might transmit a requested code for an application program through Internet 728, ISP 726, local network 722 and communication interface 718.
The received code may be executed by processor 704 as it is received, and/or stored in storage device 710, or other non-volatile storage for later execution.
Software system 800 is provided for directing the operation of computer system 700. Software system 800, which may be stored in system memory (RAM) 706 and on fixed storage (e.g., hard disk or flash memory) 710, includes a kernel or operating system (OS) 810.
The OS 810 manages low-level aspects of computer operation, including managing execution of processes, memory allocation, file input and output (I/O), and device I/O. One or more application programs, represented as 802A, 802B, 802C . . . 802N, may be “loaded” (e.g., transferred from fixed storage 710 into memory 706) for execution by the system 800. The applications or other software intended for use on computer system 700 may also be stored as a set of downloadable computer-executable instructions, for example, for downloading and installation from an Internet location (e.g., a Web server, an app store, or other online service).
Software system 800 includes a graphical user interface (GUI) 815, for receiving user commands and data in a graphical (e.g., “point-and-click” or “touch gesture”) fashion. These inputs, in turn, may be acted upon by the system 800 in accordance with instructions from operating system 810 and/or application(s) 802. The GUI 815 also serves to display the results of operation from the OS 810 and application(s) 802, whereupon the user may supply additional inputs or terminate the session (e.g., log off).
OS 810 can execute directly on the bare hardware 820 (e.g., processor(s) 704) of computer system 700. Alternatively, a hypervisor or virtual machine monitor (VMM) 830 may be interposed between the bare hardware 820 and the OS 810. In this configuration, VMM 830 acts as a software “cushion” or virtualization layer between the OS 810 and the bare hardware 820 of the computer system 700.
VMM 830 instantiates and runs one or more virtual machine instances (“guest machines”). Each guest machine comprises a “guest” operating system, such as OS 810, and one or more applications, such as application(s) 802, designed to execute on the guest operating system. The VMM 830 presents the guest operating systems with a virtual operating platform and manages the execution of the guest operating systems.
In some instances, the VMM 830 may allow a guest operating system to run as if it is running on the bare hardware 820 of computer system 700 directly. In these instances, the same version of the guest operating system configured to execute on the bare hardware 820 directly may also execute on VMM 830 without modification or reconfiguration. In other words, VMM 830 may provide full hardware and CPU virtualization to a guest operating system in some instances.
In other instances, a guest operating system may be specially designed or configured to execute on VMM 830 for efficiency. In these instances, the guest operating system is “aware” that it executes on a virtual machine monitor. In other words, VMM 830 may provide para-virtualization to a guest operating system in some instances.
A computer system process comprises an allotment of hardware processor time, and an allotment of memory (physical and/or virtual), the allotment of memory being for storing instructions executed by the hardware processor, for storing data generated by the hardware processor executing the instructions, and/or for storing the hardware processor state (e.g., content of registers) between allotments of the hardware processor time when the computer system process is not running. Computer system processes run under the control of an operating system and may run under the control of other programs being executed on the computer system.
The term “cloud computing” is generally used herein to describe a computing model which enables on-demand access to a shared pool of computing resources, such as computer networks, servers, software applications, and services, and which allows for rapid provisioning and release of resources with minimal management effort or service provider interaction.
A cloud computing environment (sometimes referred to as a cloud environment, or a cloud) can be implemented in a variety of different ways to best suit different requirements. For example, in a public cloud environment, the underlying computing infrastructure is owned by an organization that makes its cloud services available to other organizations or to the general public. In contrast, a private cloud environment is generally intended solely for use by, or within, a single organization. A community cloud is intended to be shared by several organizations within a community; while a hybrid cloud comprises two or more types of cloud (e.g., private, community, or public) that are bound together by data and application portability.
Generally, a cloud computing model enables some of those responsibilities which previously may have been provided by an organization's own information technology department, to instead be delivered as service layers within a cloud environment, for use by consumers (either within or external to the organization, according to the cloud's public/private nature). Depending on the particular implementation, the precise definition of components or features provided by or within each cloud service layer can vary, but common examples include: Software as a Service (SaaS), in which consumers use software applications that are running upon a cloud infrastructure, while a SaaS provider manages or controls the underlying cloud infrastructure and applications. Platform as a Service (PaaS), in which consumers can use software programming languages and development tools supported by a PaaS provider to develop, deploy, and otherwise control their own applications, while the PaaS provider manages or controls other aspects of the cloud environment (i.e., everything below the run-time execution environment). Infrastructure as a Service (IaaS), in which consumers can deploy and run arbitrary software applications, and/or provision processing, storage, networks, and other fundamental computing resources, while an IaaS provider manages or controls the underlying physical cloud infrastructure (i.e., everything below the operating system layer). Database as a Service (DBaaS) in which consumers use a database server or Database Management System that is running upon a cloud infrastructure, while a DbaaS provider manages or controls the underlying cloud infrastructure, applications, and servers, including one or more database servers.
In the foregoing specification, embodiments of the invention have been described with reference to numerous specific details that may vary from implementation to implementation. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense. The sole and exclusive indicator of the scope of the invention, and what is intended by the applicants to be the scope of the invention, is the literal and equivalent scope of the set of claims that issue from this application, in the specific form in which such claims issue, including any subsequent correction.