In database query processing and similar tasks, complex query execution plans may include multiple operations such that the output of one operation is the input of the next operation. Such intermediate query results may be stored or pipelined, and may include a single data structure of individual results, or of multiple data structures each containing multiple records.
As an operation obtains the items in one of its input streams, it may group items by some criterion or predicate. One example is a SQL “distinct” query. Some summary information may be derived from the items in a group, e.g., a sum or an average. One example is a SQL “group by” query. An operation with multiple inputs, for example two inputs, may match up items from the two input based on some criterion or predicate. One example may include a SQL “join” query, including the variants of “outer joins.”
Joins may also be derived from other types of query formulations, e.g., semi joins from “in” and “not in” queries in SQL. Set operations such as intersections may be requested explicitly, e.g., using SQL “intersect” syntax, or may be employed without explicit request, e.g., when intersecting lists of items, each list capturing the result set for a component of a conjunction.
A process of data manipulation operations may define the semantics of the operations and permit algebraic manipulations or rewrites of expressions in the algebra. Thus, it may be possible and beneficial to rewrite the original expression. One type of optimization may be to reduce data volumes early during expression evaluation. In other words, grouping operations that replace entire groups of records with a single summary record may be performed as early as possible, with join operations afterwards. In other words, aggregation operations on join inputs are not uncommon, in particular after optimization has been applied to the original request. Optimization may be automatic or by human effort.
In addition to database query processing, other systems employ multiple operations to satisfy entire requests, pass intermediate results between operations, and perform grouping and join operations or very similar operations in which items are matched based on a criterion or predicate. One example includes “map-reduce” data processing for “big data” in “cloud computing.”
In the sequel, database query processing may stand for any processing graph in which operations pass intermediate results as streams of information items, aggregation may stand for any operation grouping items from one input, and a join with two inputs may stand for any operation matching items from two or more inputs. The user request, database query, or overall problem may be of sufficient complexity that at least one join operation is performed and that aggregation is performed on at least two inputs. Thus separate hash tables and memory allocations may be needed for the join and aggregation operations, which can add additional expenses to a system and delay query processing.
The embodiments are described in detail in the following description with reference to the following figures.
For simplicity and illustrative purposes, the principles of the embodiments are described by referring mainly to examples thereof. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the embodiments. It is apparent that the embodiments may be practiced without limitation to all the specific details. Also, the embodiments may be used together in various combinations.
Data and information processing systems may match records from a single dataset, for example for grouped aggregation or for duplicate removal, or from multiple datasets, for example for join or intersection operations. Processes based on hash partitioning and on hash tables, for example hash join and hash aggregation, can integrate single-input operations and multi-input operations beyond previously accepted limitations.
If all inputs are sorted on their match attributes, order-based processes tend to be efficient. Aggregation can be performed immediately on the sorted inputs and the subsequent join operations can be realized with a merging process. In the sequel, examples of cases may include cases in which none or only some of the inputs are appropriately sorted.
If some or all inputs are not sorted, hash-based processes tend to be efficient. In those processes, a hash value may be computed for each input item from the input item's matching attribute. This hash value may be used both for hash partitioning (for large inputs) and for insertion and search in an in-memory hash table (for small inputs and for partitions). Partitioning effort may be reduced or avoided if the required hash tables are few and small. In the sequel, one consideration may be to reduce the count and sizes of the required hash tables.
According to an embodiment, a hash join and hash aggregation integration system (hereinafter “hash integration system”) integrates hash join and hash aggregation, and allows for hash aggregation to be applied to one or both join inputs in hash join. The hash integration system further allows for role reversal to be applied to any hash join. The role reversal may also be applied to a hash join including aggregation on one or both inputs.
The hash integration system may include appropriate information from the probe input within records retained in a hash table that is central to hash join and hash aggregation, and may perform final calculations when an output record is produced from a record in the hash table. The hash table constructed with records from the build input may contain additional fields (not present or derived from the build input) in order to accommodate aggregated information from records of the probe input. The fields may depend on the query at hand. For example, the fields may be the same as the fields in an aggregation of the input in a query execution plan with a join and two separate aggregation operations. These fields may include the grouping values and the partial aggregation values (e.g., a sum and a count). Final calculations may be performed before output is produced. For example, the sum and count may be divided to obtain an average. Each record of the probe input may match with multiple records of the build input and may therefore contribute to the aggregation calculations in multiple records in the hash table.
The hash integration system provides improved query processing performance by integrating hash join and hash aggregation. For the hash integration system, query execution run-time can apply role reversal if warranted even in join operations with integrated aggregation. Thus the hash integration system provides a single operation that can accomplish computations previously using multiple operations. The reduction to a single operation reduces effort for data transfer as well as effort and memory for hash tables.
Various aspects related to the system 100 are described before proceeding with a further description of the foregoing modules.
Operation of the modules and components of the system 100 is described.
Referring to
As briefly discussed above, an example of an application of the system 100 for a query related to a number of customers for who more merchandise has been billed then ordered is described for facilitating a description of the modules 101, 102 and 106.
For a query related to a number of customers for who more merchandise has been billed then ordered, first, all orders may be evaluated. The order volume per customer may be aggregated. For example, referring to
As described above, in order to perform aggregation on the probe input 108, or aggregation on both the build and probe inputs 107, 108, respectively, of the hash join module 101, aggregation may be performed by the hash aggregation module 102. For a given query 103, the system 100 may include appropriate information from the probe input 108 within records retained in the integrated hash table that is central to the hash join and hash aggregation modules, 101, 102, respectively. In this regard, the records in the integrated hash table may include aggregates from the probe input. For the foregoing example, the records may include the sum of invoice amounts.
Final calculations may be performed by the system 100 when an output record is produced from a record in the hash table. For example, assuming that the foregoing example query is modified for finding customers where the average invoice amount is higher than the average order amount, instead of adding averages as in the foregoing example, for every new invoice and order, the sum and count may be incremented by one. After consuming the two inputs, the resulting integrated hash table may include records including the following five fields: customer ID, total invoice amount, count of invoices, total order amount and count of orders. The two averages related to invoice amount and order amount may be obtained by the respective totals divided by the counts, and compared accordingly. Thus the foregoing final calculations may include computation of averages from the sum and count values.
The integrated hash table constructed with records from the build input 107 may contain additional fields (not present or derived from the build input 107) in order to accommodate aggregated information from records of the probe input 108. The additional fields may accommodate the aggregation of records from the probe input 108. For example, in the foregoing example for a query related to the number of customers for who more merchandise has been billed then ordered, if the build input 107 is the orders and the probe input 108 is the invoices, in order to account for aggregation of the build and probe inputs 107, 108, appropriate fields would be used in those records. The fields may depend on the query 103 at hand. For example, the field may be unique to each query. For example, instead of querying invoices and orders as described above, if the particular query is related to internet users and their click-stream behavior, then the fields and input tables would relate to the query at hand. For example, the fields may be the same as the fields in an aggregation of the input in a query execution plan with a join and two separate aggregation operations. For the foregoing example of the query related to the number of customers for who more merchandise has been billed then ordered, without the implementation of the integrated hash table via the system 100, three separate operations each including a hash table may include a hash table for aggregation on the orders, a hash table for aggregation on the invoices, and a hash table for performing the join. For a specific query, the appropriate field for each of these three operations and their hash tables may be evaluated. These fields for the three individual operations and respective hash tables may be used in the integrated hash table for the system 100. These fields may include the grouping values and the partial aggregation values (e.g., a sum and a count). For example, as an unsorted input is consumed, at any given time, a partial aggregation may be performed. For the foregoing example of the query related to the number of customers for who more merchandise has been billed then ordered, for a particular customer, if for example, 7 out of 12 invoices have been evaluated, then the sum may include the sum of 7 invoices and the count may be incremented to 7. When the entire input is consumed, all 12 invoices may be evaluated and the count may be incremented to 12. Thus the partial aggregation values may be updated each time a new input record is seen. When all inputs are consumed, calculations may be performed on the final partial aggregation record.
Final calculations may be performed before output is produced. For example, the sum and count may be divided to obtain an average. For the foregoing example of the query related to the number of customers for who more merchandise has been billed then ordered, in order to determine the individual orders where the order volume is larger than the average invoice, on the build input (the orders), no aggregation may be performed (i.e. only individual orders are evaluated and groups are not collapsed into single summaries). On the probe input (the invoices), groups may be collapsed to obtain an average invoice. In order to process such a query with a single operation and a single hash table, the build input (the orders) may be consumed, and every oldest record may be inserted into the hash table. Once all records are inserted into the hash table, the probe input records may be consumed. For every probe input record with an individual invoice, an average invoice amount may be computed. For example, if a customer has submitted three orders, the average invoice amount may be computed for that customer in three places so that each of these orders in the hash table may be compared with the average invoice. Thus for a customer with three such orders, three order records may be placed in a hash table, an individual invoice record may be consumed, and in each of the three records in the hash table, the total invoice amount may be computed and the count of the invoices may be recorded. Upon completion, the three records may include the fields customer ID, individual order volume, total invoice volume, and invoice count. The total invoice volume may be divided by the invoice count to obtain an average invoice amount, and the average invoice amount may be compared to the individual order total. Based on the comparison, it can be determined if the order was or was not larger than the average invoice.
Each record of the probe input 108 may match with multiple records of the build input 107 and may therefore contribute to the aggregation calculations in multiple records in the hash table. Thus in order to perform aggregation on the probe input 108, the system 100 may include the integrated hash table including records that have sums from both inputs.
The integrated hash table used with the system 100 thus integrates the aggregation at the build input 108, or the two aggregations at the build and probe inputs 107, 108, and a subsequent join at the hash join module 101. Alternatively, instead of two aggregations at the build and probe inputs 107, 108, the system 100 may include aggregations from two inputs, and possibly multiple aggregations from each input. For example, for the foregoing example, a query may compute not only the average invoice amount but average, minimum, and maximum. The integrated hash table may contain records with fields. Each record in the integrated hash table may compute two aggregates. Thus for the foregoing example of the query related to the number of customers for who more merchandise has been billed then ordered, the records in the integrated hash table may contain the customer ID, the total order amount and total invoice amount. As records from the two inputs are consumed, the order amount and invoice amount may be incremented such that when all records from the inputs have been consumed, the totals may be compared as needed. More generally, for the probe input 108, or for each of the build and probe inputs 107, 108, an aggregation may be performed, with the results being joined by the hash join module 101. The aggregation and join operations may be focused on the same column set (i.e. in the foregoing example, the customer ID). The integrated hash table that integrates the two aggregations and join operation may contain intermediate records that contain all the grouping columns (i.e. in the foregoing example, the customer ID), all the aggregation columns for the build input 107 (i.e. in the foregoing example, the total orders), and all the aggregation columns for the probe input 108 (i.e. in the foregoing example, the total invoices). Thus, generally, for an operation involving multiple inputs (e.g. 1 to n inputs generally), the integrated hash table may likewise include a single record with aggregation columns from the n inputs.
Thus based on the foregoing example, aggregation may be performed on the probe input 108, or on both the build and probe inputs 107, 108, respectively of the hash join module 101. As discussed above, instead of three operations (two aggregation operations and one join operation, with role reversal supported in this join) or two operations (one aggregation and one join with integrated aggregation for the build input, with role reversal inhibited), for the system 100, a single operation may be performed (i.e. a join with aggregation on just the probe input or on both inputs) and role reversal in this join may be supported in case of mistaken cardinality estimation during compile-time query optimization. In another example, for a query search for customers for whom the oldest invoice pre-dates the oldest order, if cardinality estimation during compile-time query optimization anticipated fewer customers with orders than customers with invoices, yet run-time observation during query execution proves this estimate wrong (perhaps even for some partitions), then role reversal can reduce the memory allocation of the operation and also reduce the number of partitioning levels.
The system 100 thus provides for database query processing, for example, for ‘map-reduce’ data processing. The integrated hash table generated by the hash table generation module 106 may be used as the associative data structure. Other forms of in-memory search trees may also be used as the specific associative data structure. The information items may be individual records or rows in a table. The information items may also be complex objects represented in any computer-appropriate format. The information items may be packaged into sets in the inputs and outputs of the system 100, or processed one at a time within the system 100. The system 100 may also use a grouping operation such as a ‘group by’ query in ANSI SQL or a duplicate removal operation. The grouping operation may also cache parameters for a nested query or for a function, or cache parameters and results for a nested query or for a function. The join operation may also include any form of outer join in ANSI SQL, any form of semi join, or any form of binary set operation such as intersection. The intermediate result items may include multiple input items, and/or capture summary information about multiple input items. Multiple such operations may cooperate with each other and with other operations in a query execution plan or data flow plan. Further, items in the hash table may capture or summarize input items from multiple inputs. As discussed above, the sequence of inputs may be mutable, also known as role reversal in the special case of two inputs.
The system 100 thus provides for deep integration of aggregation and join operations as well as intermediate results records in the hash table that contain intermediate fields for multiple aggregations, e.g., sums and counts as appropriate for two separate average calculations. The inputs that require aggregation are consumed first. For each item, a match is sought in the hash table. If one is found, an appropriate aggregation action is performed. The appropriate aggregation action depends on the input from which the item originated. The intermediate record in the hash table may have multiple aggregation fields, each one associated with one of the inputs. For example, if averages are needed for two (or generally N) inputs, each intermediate record has two (N) sums (one for each input) and two (N) counts (one for each input).
If no match is found, the appropriate action depends on the join condition among the inputs and the processing sequence for inputs. If no match is found while the first input is consumed, a new record is inserted into the hash table. If no match is found while the second input is consumed, no match implies that the current input record fails the join predicate. The appropriate action depends on the type of join, i.e., inner join versus outer joins.
After consuming the inputs using aggregation, other join inputs may be consumed. The hash table may be probed for matches and actions may be taken as appropriate for the type of join and the complete join predicate, including predicates on the fields computed by aggregation.
At block 301, the system 100 may receive the query 103 and ascertain the requirements of the query. For example, the user 104 may present the query 103 pertaining to the data 105 to the system 100.
At block 302, based on the query, the system 100 may perform aggregation on the probe input 108, or aggregation on both the build and probe inputs 107, 108, respectively, of the hash join module 101, by the hash aggregation module 102, and generate the integrated hash table by the hash table generation module 106. For a given query 103, the system 100 may include appropriate information from the probe input 108 within records retained in the integrated hash table that is central to the hash join and hash aggregation modules, 101, 102, respectively. In this regard, the records in the integrated hash table may include aggregates from the probe input. The integrated hash table used with the system 100 thus integrates the two aggregations at the build and probe inputs 107, 108 (assuming aggregation is performed on both the build and probe inputs 107, 108), and a subsequent join at the hash join module 101. More generally, for just the probe input 108 or for each of the build and probe inputs 107, 108, an aggregation may be performed, with the results being joined by the hash join module 101. The aggregation and join operations may be focused on the same column set. As described above, the integrated hash table that integrates the two aggregations and join operation may contain intermediate records that contain all the grouping columns, all the aggregation columns for the build input 107, and all the aggregation columns for the probe input 108. Thus, generally, for an operation involving multiple inputs (e.g. 1 to n inputs generally), the integrated hash table may likewise include a single record with aggregation columns from the n inputs.
At block 303, the system 100 may generate the query response 109 based on the output of the hash join module 101.
The computer system 400 includes a processor 402 that may implement or execute machine readable instructions performing some or all of the methods, functions and other processes described herein. Commands and data from the processor 402 are communicated over a communication bus 404. The computer system 400 also includes a main memory 406, such as a random access memory (RAM), where the machine readable instructions and data for the processor 402 may reside during runtime, and a secondary data storage 408, which may be non-volatile and stores machine readable instructions and data. The memory and data storage are examples of computer readable mediums.
The computer system 400 may include an I/O device 410, such as a keyboard, a mouse, a display, etc. The computer system 400 may include a network interface 412 for connecting to a network. Other known electronic components may be added or substituted in the computer system 400.
While the embodiments have been described with reference to examples, various modifications to the described embodiments may be made without departing from the scope of the claimed embodiments.