A database system allows large volumes of data to be stored, managed and analyzed. Data records for a relational database system may be associated with tables. A table may include one or more rows, where each row may contain a set of related data (e.g., data related to a single entity). The data for the row may be arranged in a series of fields, or columns, where each column includes a particular type of data (e.g., type of characteristic of an entity). Processing nodes of the database system may process requests for transactions, such as queries, operations to add tables, operations to add or drop columns to/from tables, and so forth. The requested operations may be expressed in a specified format (a Structured Query Language (SQL) format, for example).
A database management system (DBMS), or “database system” may have at least three primary components that are relatively tightly tied together: a storage system (for storing and retrieving data, such as table data and metadata describing database objects, for example); a data processing engine (for processing queries and performing data load transactions, for example); and a language (a structured query language (SQL) for describing database transactions, such as queries, data load transactions, table creations transactions, and so forth).
A query is a request for information related to a database table or a combination of database tables. For purposes of processing a query to satisfy the request, the database system may materialize data from its storage system. For example, processing the query may involve the database system reading files (containing table data) and metadata (representing information contained in database tables). This means that the database processing system reading the files understands a file format that is associated with the files (e.g., the database system understands how the table data is stored in the files).
A particular database system may structure its query processing so that the processing is optimized for table data that is stored according to a particular file format. A relatively complex enterprise environment, however, may be associated with table data that is associated with a number of different file formats.
As a more specific example, a database system may be constructed to optimize its query processing based on the table data being associated with a file format that supports table projections. In general, a “table” refers to a particular dataset structure in which data values are arranged in rows and columns; and in general, a projection is a column-oriented view of a table and has a table-like structure. In this manner, a projection is a view containing one or multiple selected columns of a table. For example, a table may have columns A, B, C and D. A first projection for the table may be a projection that includes columns A and D; another projection may contain columns A and B; another projection may include columns A, B and D; and so forth. Read-intensive workloads may benefit from accessing table projections. For example, a given query may target a particular subset of table columns (i.e., target a projection of the table), thereby allowing a processing node to read the data for a selected subset of columns to process the query, as opposed to, for example, reading the data for entire rows of the table. In addition to being a particular subset of columns of the table, a projection may also have an associated ordering. For example, the rows of a projection may be ordered by data, employee number, and so forth, as examples.
As a more specific example, the database system may process the following query:
The data for a table (also referred to as a “data set” for the table) may be stored according to a particular file format, such as a row optimized storage (ROS) file format. In general, the ROS file format refers to a column-based file format, which is associated with a particular column and is associated with a particular table projection. As an example, for a table T, a database system may create and use the following projections: a first projection P that contains columns A and D of the table T; and a second projection Q that contains columns A, M and N of the table T. For projection P, the database system may store a first column file associated with the projection P and including the data for column A; and a second column file that is associated with the projection P and contains the data for column D. For the projection Q, the database system may store a third column file associated with the projection Q and including the data for column A; a fourth column file that is associated with the projection Q and contains the data for column M; and a fifth column file that is associated with the projection Q and contains the data for column N. Thus, there is a one-to-one correspondence between the columns of a given projection and the ROS files.
A query scan engine for the database system may be optimized for table projections, and as such, may not be constructed to process a query in an optimized manner when the data set is stored in files that are associated with a non-ROS-based file format. For example, the table data may be stored in a Parquet-based file format. In the Parquet-based file format, the data is stored in files, where each file contains multiple row groups of a table. Here, a “row group” refers to a set of data describing values for a set of rows (having multiple columns) of the table. Due to the lack of correspondence with table projections, the query scan engine that is optimized for processing data stored in ROS files may be unable to, in a straightforward manner, determine optimal query plans. Instead, for a given query, the database system may materialize data from all of the non-ROS files that may contain data to satisfy the query, apply a filter, to produce a filtered set of data, and then scan this filtered data for purposes of processing the query. Such an approach, however, may consume a significant amount of storage and processing resource.
In accordance with example implementations that are described herein, a database system includes a universal data scan interface, which is constructed to aid the processing of a query by a query processing engine by extracting a uniform set of information about a data set for a database table that is specified by a query, regardless of the particular file format that is associated with the data set. More specifically, in accordance with example implementations, the universal data scan interface has partition, container and column interfaces.
In general, the partition interface may be used by the query processing engine to provide a set of partition objects associated with column-based partitioning of the data set associated with the table specified by the query; and exclude partition objects for corresponding partitions that do not contain relevant data to satisfy the query. The container interface provides a set of data container objects, which correspond to data containers for the partition objects provided by the partition interface. The container interface may be used by the query engine to identify relevant data containers for providing the information requested by the query and correspondingly create scan handlers to materialize the data from the corresponding data containers.
More specifically, in accordance with example implementations, a factory interface of the universal data scan interface provides a list of partition objects for a table that is specified in a query. The list may include zero partition objects (e.g., the case for which data has not been loaded into a table), one or multiple candidate partition objects. Here, the “partition” refers to a column-based partitioning of the data set, and a candidate partition represents a partitioned set of data that may include data to satisfy the query. A partition is represented in the database system by a partition object. For a dataset that is stored in ROS files, a candidate partition object may correspond to a particular table projection (whose data may be stored in multiple ROS files); and for a dataset that stored in non-ROS files, such as Parquet files, a candidate partition object may correspond to a particular file directory (i.e., a directory containing any number of Parquet files). As described further herein, if column metadata (data describing minimum and maximum columns values and a column sort order, for example) describing the partition objects is available, the partition interface filters the candidate partition objects to exclude partition objects that are irrelevant to the query (i.e., filter out partition objects that do not include any values requested by the query).
The container interface of the universal data scan interface provides data container objects for the candidate partition objects that are provided by the partition interface. In this context, the data for a given partition may be stored in one or multiple data containers. For example, for a data set that is stored in ROS files, a partition may correspond to a table projection, and the data for the partition may be stored in a set of ROS files, which are the data containers. As another example, for a data set that is stored in Parquet files, a partition may correspond to a particular set of Parquet files, the data for the partition may be stored in one or multiple data containers, and one container may correspond to one row group per file in the partition directory. The container interface, in accordance with example implementations, may filter the candidate container objects based on column metadata to provide a filtered set of container objects; and the container interface may create a column interface for each container object of the filtered set.
The column interface is used by a column handler interface of the query processing engine to scan the corresponding container object for data values that satisfy the query of the predicate, regardless of the file format for the dataset for the table.
In accordance with example implementations, the database system may include one or multiple processing nodes that process database transactions (transactions associated with database queries, transactions associated with data load operations, and so forth) for purposes of accessing, analyzing, loading and generally managing data that is stored in a database store. In this context, a “processing node” refers to a physical machine, such as a physical machine that contains one or multiple hardware processors (central processing units (CPUs), CPU cores, and so forth). As examples, the processing node may be a personal computer, a workstation, a server, a rack-mounted computer, a special purpose computer, and so forth. A “transaction” refers to one or multiple operations, which are executed as a unit of work by the processing node. As further described herein, an initiator processing node may, for example, receive a query and determine a subset of processing nodes to process the query.
As a more specific example,
The database system 100 includes one or multiple processing nodes 110; and each processing node 110 may include one or multiple personal computers, work stations, servers, rack-mounted computers, special purpose computers, and so forth. Depending on the particular implementation, the processing nodes 110 may be located at the same geographical location or may be located at multiple geographical locations. Moreover, in accordance with example implementations, multiple processing nodes 110 may be rack-mounted computers, such that sets of the processing nodes 110 may be installed in the same rack. In accordance with example implementations, a given query may be processed by multiple processing nodes 110, as further described herein.
In accordance with example implementations, the processing nodes 110 may be coupled to a shared storage 160 of the database system 100 through network fabric (not shown in
The storage 160 is a “shared storage,” in that the storage 160 may be shared, or accessed, by multiple processing nodes 110. In accordance with example implementations, the shared storage 160 stores ROS files 164 and non-ROS files 166. In accordance with example implementations that are described herein, the non-ROS files 166 are Parquet files that are stored in a file directory structure, in which file paths are associated with different column partitions of a given table. Moreover, in accordance with example implementations, the Parquet file contains multiple row groups for a table; each row group is associated with a particular set of columns of the table and a particular group of rows for the table. Moreover, in accordance with example implementations, each row group may be considered a “data container.” For the ROS file format, a data container is a set of one or multiple ROS files, which represent a particular column for a particular projection of a database table.
In accordance with example implementations, database objects, such as tables, projections, columns, and so forth, may be associated with catalog objects 120. As examples, the catalog objects 120 may contain metadata 121 corresponding to tables, projections, columns and various tables represented by the ROS files 164, and information about the ROS files 164.
The catalog objects 120 may be stored in one or multiple catalogs, such as a catalog 123 that is illustrated in
The shared storage 160 may include one or multiple physical storage devices that store data using one or multiple storage technologies, such as semiconductor device-based storage, phase change memory-based storage, magnetic material-based storage, memristor-based storage, and so forth. Depending on the particular implementation, the storage devices of the shared storage 160 may be located at the same geographical location or may be located at multiple geographical locations.
In accordance with example implementations, a given processing node 110 may include a query processing engine 122 and a universal data scan interface 124. The query processing engine 122 may use the universal data scan interface 124 to process a given query 119 in a series of phases, including a global planning phase, (for the case in which the processing node 110 is the initiator node for the query), a local planning phase, and an execution phase. For this example, the processing node 110 may be, for example, a query initiator node, i.e., the node 110 that receives the query 119. For this case, the query processing engine 122 communicates with the universal data scan interface 124 to, in the global planning phase, determine which processing nodes 110 are to be involved in processing the query 119. In the local planning phase, each processing node 110 that is involved in the query determines out the actions needed for purposes of answering the node's part of the query. In the execution phase, each processing node 110 involved in the query processes one or multiple data containers to supply its part of the data that satisfies the query.
As further described herein, in accordance with example implementations, the query processing engine 122 may use the universal data scan interface 124 to analyze the dataset that is associated with the table(s) that are specified by the query 119 to identify data container objects (corresponding to data containers) to be scanned and to create column interfaces 127, which a column handler interface 123 uses to scan the corresponding data containers for data values that satisfy the query of the predicate, regardless of whether the dataset for the table that is specified in the query is associated with a ROS file format or a non-ROS file format. For the specific example implementations that are described herein, it is assumed that the non-ROS files are Parquet files. However, in accordance with further example implementations, the universal data scan interface 124 may identify the relevant data containers and create the appropriate scan handlers for data sets that are stored in files other than ROS or Parquet files.
In accordance with example implementations, the processing node 110 may include one or multiple physical hardware processors 134, such as one or multiple central processing units (CPUs), one or multiple CPU cores, and so forth. Moreover, the processing node 110 may include a local memory 138. In general, the local memory 138 is a non-transitory memory that may be formed from, as examples, semiconductor storage devices, phase change storage devices, magnetic storage devices, memristor-based devices, a combination of storage devices associated with multiple storage technologies, and so forth.
Regardless of its particular form, the memory 138 may store various data 146 (data representing metadata associated with the catalog objects 120, writesets representing the results of uncommitted changes to database objects, and so forth). The memory 138 may store machine executable instructions 142 that, when executed by one or multiple processors 134, cause the processor(s) 134 to form one or multiple components of the processing node 110, such as, for example, the query processing engine 122, the universal data scan interface 124, and so forth. In accordance with example implementations, the memory 138 may store machine executable instructions 142 that, when executed by the processor(s) 134, cause the processor(s) 134 to process queries; determine, or identify, one or multiple relevant partition objects associated with the processing of a query; identify relevant data container objects that are associated with the partition object(s); and assign scan handlers to scan the data containers corresponding to the identified data container objects to process the queries.
The following are example statements to create a database table and load data for the table, for the case in which the database table is associated with ROS files 164. First, a table called “foo” may be created using the following statement:
For the table foo, there will be data with the three indicated columns. The database system 100 stores a corresponding object (called the “foo object” herein and corresponds to the table foo) in the catalog 123.
Next, a projection (called the “foo_1” herein) for the table foo may be created by the following statement:
Another projection foo_2 may be created by the following statement:
When data is loaded into the table foo, the database system creates ROS files 164 containing each of the columns “c” and “a.” The data in these files 164 may be sorted according to column “c,” then column “a” as the secondary key. Data is distributed among the different database processes according to some hash of “c.” Moreover, an object “foo_2” is created in the catalog 123. The “foo” table object has a link to the “foo_2” projection object.
Data may then be loaded into the foo table using the following statement:
As an example, a Parquet table (called “parq”) may be created using the following statement:
Referring to
In accordance with example implementations, the partition interface 128 of the universal data interface 124 may, in general, filter the list of candidate partition objects to exclude irrelevant partition objects and provide a list of candidate container objects for each non-excluded partition object. More specifically, in accordance with example implementations, the partition interface 128 may include an output range method 214. The output range method 214, as depicted at reference numeral 216, may provide column metadata for partition objects. As an example, for non-ROS files 166, such as Parquet files, the output range method 214 may parse metadata from the partition directory paths. This metadata contains information about the columns in a particular partition object, such as the minimum column value, the maximum column value, and the sort order about each column. The partition interface 128 may further include, in accordance with example implementations, a prepare method 220, which, as depicted at reference numeral 230, provides a list of container objects for each non-filtered partition object. For ROS files 164, the prepare method 220 may, for example, derive the information from the catalog 123, as each storage container catalog object is a data container. For non-ROS files 166, such as Parquet files, the prepare method 220 may list files in a partition directory matching the requested file path. For each file, the prepare method 220 may identify all of the row groups within the Parquet file.
As also depicted in
The following sets forth example queries and the processing of these queries by the database system 100. The example queries are directed to the foo and parq tables.
The first example query is directed to the parq table:
Next, each of the processing nodes 110 that are involved in processing the query may perform the following actions. The query processing engine 122 calls the localize method 208 of the factory interface 125, which generates the list of partition objects that the processing node 110 is to process. The query processing engine 122 then builds the list of container objects by calling each partition object's prepare method 220 (of the partition interface 128). The prepare method 220 creates corresponding column interfaces 127, the column handler interface 123 asks each data container for its column interface 127 associated with “a,” and the column handler 123 invokes the column interface 127 to process the data.
As another example, the database system 100 may process the same query to a database table foo whose data is stored in files associated with the ROS-based file format:
Next, each processing node 110 that is involved in processing the query may perform the following actions. The query scan engine 122 calls the localize method 208, which provides the list of partition objects. In accordance with example implementations, each processing node 110 has just a single partition object, which represents all of the data. As such, there is a single partition. Subsequently, the query processing engine 122 asks the partition interface 128 for its list of container objects. In accordance with example implementations, the partitions interface 128 determines this by searching for all storage container objects in the catalog 123, which correspond to “foo_1.” A storage container object is created for each of these container objects. The column handler interface 123 then asks the container interface 126 to create a column interface 127 for each data container object to execute the query, and the column handler 123 invokes the column interface 127 to process the data.
As another example, the database system 100 may process the following example query to the parq table:
Next each processing node 110 that is involved in processing the query may perform the following actions. The query processing engine 122 calls the localize method 208, which generates a list of the partition objects that the processing node 110 are to process. The query processing engine 122 next builds the list of container objects by calling each partition object's prepare method 220. In the execution phase, the scan handler asks each container for its column interface associated with “a,” and query processing engine 122 invokes the column method 242 to process the data.
The database system 100 may process the same query above but directed to the foo table:
Next, each processing node 110 that is involved in processing the query may perform the following actions. The query processing engine 122 calls the localize method 208, which provides a list of partition objects. Because, as discussed above, each processing node 110 has just a single partition object, and as such represents all the data on node 110. The query processing engine 122 asks each partition interface 128 for its list of container objects. In accordance with example implementations, the partitions interface 128 may determine this by searching the catalog 123 for all storage containers, which correspond to “foo_1.” A container interface 126 is created for each storage container object in the catalog. Lastly, the column handler interface 123 calls the container interface 126 to generate column interfaces 127 and correspondingly process the data for the query.
As another example, the database system 100 may process the following query directed to the parq table:
Next, each processing node 110 that is involved in processing the query may perform the following actions. The query processing engine 122 calls the localize method 208, which generates a list of the partition objects that the processing node 110 will process. The query processing engine 122 asks each partition for its minimum and maximum values and discards the partitions whose bounds cannot satisfy the “a=10” predicate. The query processing engine 122 then builds the list of container objects by calling each remaining partition object's prepare method 220. The column handler interface 123 asks each container for its minimum and maximum values and discards containers whose bounds cannot satisfy “a=10.” The column handler interface 123 then asks each container for its column interface 127 associated with “a,” and the column interface 127 is invoked to process the data.
The following are examples of the database system's processing of queries directed to the above-described example foo table, where the foo table has a dataset that is stored in ROS-based files.
For the foo table, the database system may process the following example query:
Referring to
Next, each processing node 110 that is involved in processing the query may perform the following actions. The query processing engine 122 uses the plan method 204 to determine if the data is sorted in a way that may be useful for the queries. In accordance with example implementations, the plan method 204 selects the foo_1 projection, because its sort order is better suited for answering the query predicate. Subsequently, the query processing engine 122 calls the localize method 208, which provides a list of partition objects. Here, there is a single partition, as described above. The query processing engine 122 next asks the partition interface 128 for its minimum and maximum column values. In accordance with some implementations, there is no such information for projections, so the return value indicates there is no such metadata, and therefore, the result is ignored. The query processing engine 122 may then ask the partition interface 128 for its list of container objects. The partition interface 128 may then search the catalog 123 for all storage container objects corresponding to the foo_1 projection and create a storage container interface 126 for each of these storage container objects. Subsequently, the query processing engine 122 may ask the container interface 126 for its minimum and maximum values. Each storage container reads its minimum and maximum column values from the associated catalog object and reports back. Moreover, the minimum and maximum values from the catalog may be used to filter the container objects. Subsequently, the column handler interface 123 calls the container interface 126 to ask for the column interfaces 127 associated with “a.” The column handler interface 123 then invokes the container interface 126 using the column interfaces 127 to scan the data.
As a more specific example of the scanning of the Parquet files by the database system 100, the following query may be directed to a table called “orders”:
Referring to
To further determine which, if any, of these files contains data containers that satisfy the predicate of the query and accordingly includes and/or excludes the data containers, minimum value and maximum value metadata may be considered. More specifically, in accordance with some implementations, each of the files 166-1 and 166-2 contains metadata identifying minimum and maximum values for each column of data and sort order. As depicted in
Referring to
Referring to
Thus, due to the use of the universal data scan interface 124, individual data container objects (which correspond to corresponding data containers) are identified, which allows the passing of information of the scanning of one data container to aid in the scanning of another data container. As another example, in accordance with some implementations, multiple query scan handlers (associated with different data containers) may use sideways information passing (SIP) for processing an inner table join. In this manner, the information gained from the scanning from the inner table may be passed along through SIP for the scanning for the outer table.
Thus, referring to
Referring to
Referring to
While the present disclosure has been described with respect to a limited number of embodiments, those skilled in the art, having the benefit of this disclosure, will appreciate numerous modifications and variations therefrom. It is intended that the appended claims cover all such modifications and variations.