A database system may include data that is organized in various tables. Each table typically includes one or more rows (also known as tuples or records) that include a set of related data (e.g. related to a single entity). The data for each row may be arranged in a series of columns or fields, wherein each column includes a particular type of data (e.g. type of characteristic of an entity).
A table may contain data that is related to data in another table. For example, in a first table, each row may represent an individual item (e.g. person, object, or event). In a second table, each row may represent a classification group (e.g. organization to which person belongs, places where objects may be located, time periods where events may occur). Tables of a database may be related to one another. For example, a column of the first table may associate each individual item represented there by a reference to one of the classification groups in the second table.
A query to the database may retrieve data that is related in a defined manner from different tables of the database. For example, a query may be expressed in SQL (Structured Query Language) or in another form. A query may be represented as a joining of the tables that are addressed by the query. For example, two tables may be joined by selecting a row of each table that satisfies a criterion (e.g. a particular column value in the row) to form a row in a joined table. In the above example, joining the first and second tables may result, for example, in a joined table in which a row includes a characteristic of an item from the first table together with a characteristic of a group (from the second table) with which that item is associated. In the case of a complex join operation (e.g. where several tables are joined in a sequence of individual join operations) the join operation, and thus the query, may be optimized by modifying an order in which the various individual join operations are executed.
A given database system may have multiple database nodes (computers, workstations, special purpose computers, rack mounted computers, and so forth) and a shared storage. In other words, an object that is stored in the shared storage may be available to multiple database nodes. Having universally addressable objects may complicate the task of purging, or deleting, objects that exceed their lifetimes, in a process called “garbage collection.”
One way to perform garbage collection for a database system that has a shared storage is to frequently synchronize the object states across the nodes for purposes of identifying objects that are no longer in use. However, such an approach may inhibit performance of the database system.
In accordance with example implementations that are described herein, a relaxed, or “lazy,” technique may be utilized to delete objects stored in a shared storage of a database system. More specifically, in accordance with example implementations, the shared storage may store “committed” objects. As used herein, a “committed” object means that all of the nodes of a given cluster of nodes of the database system have acknowledged the parent transaction and as such, have added the object to their respective metadata stores, or “global catalogs.” From that point on, any node in the cluster may refer to the underlying object in a query being processed by the node; and a data object that is referenced by any ongoing query or referenced in the copy of the node's global catalog may not be deleted until the ongoing query completes and the referencing table in the catalog is dropped. An object that is not referenced by an ongoing query by any of the nodes of the cluster or in any node's global catalog copy is considered “dangling” and may be purged, or deleted, by the database system's garbage collector.
For purposes of the garbage collector, determining when to delete a shared object involves knowledge of the reference or references to that object across the cluster. Given a potentially large number (millions, for example) of shared objects, the cost of each node acknowledging its “referenced set” of storage objects may be relatively expensive and as such, may adversely affect the performance of the database system. Moreover, an assumption may not be made that a union of all referenced sets is cumulative. In this manner, unless measures are employed to prevent a query from making progress during deletion by the garbage collector, new and unaccounted for data objects may be created and purged prematurely. In accordance with the garbage collection approach that is described herein, the largest subset of objects that is safe to delete without reducing the overall clusterability is determined, and the garbage collector deletes, or purges, objects in this subset.
More specifically, in accordance with example implementations, each database node of a given cluster stores a copy of a global catalog, which identifies objects that are stored in the shared storage. The global catalog may change with time, and so, different database nodes may store different versions of the global catalog at a given time. These catalog versions are time ordered, such that at any one time, different nodes of the cluster may store a different time ordered version of the global catalog. Moreover, for a given time, a given node may have one or multiple ongoing queries, which, in turn, are associated with one or multiple objects. These ongoing queries, in turn, are associated with time ordered versions of the global catalog. Since the catalog versions are time ordered, for each query, an earliest catalog version associated with the query may be identified. In other words, for each query, all the associated global catalogs may be listed according to respective time orderings, and an earliest catalog version may be selected from this list. Accordingly, the selected catalog version is an earliest global catalog version referencing a time before which there are no queries on the node that reference data objects created at a previous version of the global catalog. This selected version of the global catalog may be referred to as a “Minimum Queryable Version,” or “MQV.” Accordingly, the catalog version at the start time of the earliest ongoing query on the node is the selected MQV.
In accordance with example implementations, the MQVs are communicated, in an asynchronous gossiping process, around the cluster at regular intervals, which allows each node to effectively segment the object space by their expected lifetime. Because, in accordance with example implementations, each data object may be tagged with a version of the global catalog corresponding to the time of the object's deletion, the MQV may be used to identify, or determine, the set of data objects that may still be needed by one of the nodes of the cluster, even if the node(s) are no longer referenced by a table. In this manner, the asynchronous gossiping process disclosed herein allows nodes to remain aware of data objects that are no longer referenced by tables in the present but may be needed by other nodes with ongoing queries started in the past.
As depicted in
Each node 110 may store a global catalog 112 (i.e., a specific copy of a global catalog used by the nodes 110 of the cluster 106) for purposes of referencing the objects 154 stored in the shared storage 150.
A given node 110 of the cluster 106 may have one or multiple ongoing queries 115, which reference tables or objects, which are not visible to other nodes 110 of the cluster 106. For purposes of deleting objects from the shared storage 150 that have exceeded their lifetimes, the database system 100 includes a garbage collector 114, which is schematically depicted in
Referring to
Thus, in accordance with some implementations, at regular intervals (decoupled from the asynchronous gossip intervals, for example), the garbage collector 114 may delete all data objects 154, whose catalog version is smaller than the minimum of the MQVs 130 in the cluster 106. Given that catalog versions are strictly increasing (i.e., monotonic), no node 110 of the cluster 106 can refer to data objects with earlier catalog versions than the minimum MQV for that the node.
Referring to
Referring back to
In accordance with example implementations, the cluster clerk engine 116 performs a technique 500 that is depicted in
In accordance with example implementations, a given database node 110 may not generate/regenerate the bloom filter upon a given request from the cluster clerk engine 116, even amidst the removal of an encoded data object. In this manner, the tolerance for eventual consistency allows the bloom filters to be cached and strictly added to, while disregarding dangling files courtesy of its probabilistic nature.
Referring to
The queries 614 may be, in accordance with example implementations, parsed by a query parser and optimizer 620 of the node 110. In general, the query parser and optimizer 620 may consult the global catalog 112 to determine the locations of objects (for example, in the shared storage 150), which are referenced by the queries 614. The query parser and optimizer 620 develops a corresponding query plan 630 for a given query 614, which is provided to an execution engine 634 of the node 110. The execution engine 634, in turn, causes a storage access layer 640 of the node 110 to access the shared storage 105 and provide corresponding data blocks 638 back to the execution engine 434 in response to the executed query plan 630.
In accordance with example implementations, the database node 110 may further include a write cache 670 that caches the data records 660 received by the node 110 associated with corresponding data load operations. Moreover, a data load engine 674 of the node 110 may read data from the write cache 670 and rearrange the data into read optimized stored (ROS) containers 650 that are provided to the storage access layer 640 for purposes of storing the ROS containers 650 in the appropriate segments of the shared storage 150.
As also depicted in
In accordance with example implementations, the node 110 may include one or multiple physical hardware processors 680, such as one or multiple central processing units (CPUs), one or multiple CPU cores, and so forth. Moreover, the database node 110 may include a local memory 684. In general, the local memory 684 is a non-transitory memory that may be formed from, as examples, semiconductor storage devices, phase change storage devices, magnetic storage devices, memristor-based devices, a combination of storage devices associated with multiple storage technologies, and so forth. Regardless of its particular form, the memory 684 may store various data (data representing the global catalog 150, data representing parameters used by the components of the node 110, and so forth) as well as instructions that, when executed by the processor(s) 680, cause the processor(s) 680 to form one or multiple components of the node 110, such as, for example, the query parser and optimizer 620, the execution engine 634, the storage access layer 640, the data load engine 674, the garbage collector 114, the cluster clerk engine 116 and so forth.
In accordance with further example implementations, one or multiple components of the node 110 (such as the garbage collector 114, the cluster clerk engine 116, the execution engine 634, the query parser and optimizer 620, and so forth) may be formed from dedicated hardware that is constructed to perform one or multiple specific functions, such as a field programmable gate array (FPGA), an Application Specific Integrated Circuit (ASIC), and so forth.
Referring to
In accordance with further implementations, an apparatus 800 that is depicted in
Other implementations are contemplated, which are within the scope of the appended claims. For example, in accordance with further example implementations, the object space may be partitioned into multiple partitions; and each node may identify an MQV for each object space partition. In this manner, for a given object space partition, the node may communicate to the garbage collector the MQV for that partition. In other words, in accordance with example implementations, a given node, for each object partition, may asynchronously communicate to the garbage collector the global catalog version that was created or published before or at the same time of the earliest ongoing query that is being processed by the node and involves an object in the object partition. Thus, in accordance with example implementations, each node may maintain a collection of <Partition, MQV> values; and each node may asynchronously communicate its <Partition, MQV> values to the garbage collector. The garbage collector, in turn, may determine the minimum MQV for each object partition, and delete the objects stored in the shared storage that have version tags the same or older than the minimum MQV for each partition. The partitioning of the object space and use of the MQVs in this manner allows the cleanup of objects in some partitions, even in the presence of long running queries accessing other partitions.
While the present disclosure has been described with respect to a limited number of implementations, those skilled in the art, having the benefit of this disclosure, will appreciate numerous modifications and variations therefrom. It is intended that the appended claims cover all such modifications and variations.