Integrating analytical functions (AFs) within Structured Query Language (SQL) queries is becoming the norm of most emerging applications. Such integration poses new challenges to the query optimizer because these AFs are typically black boxes with no known costs or characteristics, and hence many optimization and transformation rules cannot kick in. The roots of this problem represent a classical challenge to the database systems because the same challenges apply to the user-defined-functions (UDFs) in general. However, the problem is becoming more intense and profound because: (1) current AFs are far more complex and expensive than classical UDFs, (2) many AFs are now expressed as table functions (or table operators) that can appear in the FROM clause and get involved in complex join operations, and (3) unlike the standard in database (in-DB) UDFs, many AFs are now provided by external engines, e.g., ASTER®, where the ASTER trademark is owned by Teradata US, Inc., TENSORFLOW®, where the TENSORFLOW trademark is owned by Google LLC, and SPARK™, where the SPARK trademark is owned by the Apache Software Foundation, and hence their execution involves high data transfer and network overheads.
Optimizing the execution order between AFs and joins in SQL queries is a challenge.
In one aspect, a method for optimizing execution of a query invoking an analytical function (AF), wherein the query includes a join operation between an AF table and an AuxiliaryTable, includes determining that the AF includes a plurality of AF properties. The method includes inferring query-level properties about the query. The method includes determining from the plurality of AF properties and query-level properties to change an order of the join operation.
Implementations may include one or more of the following. Changing the order of the join operation may include removing a join with the AuxiliaryTable from the query, joining the AuxiliaryTable with the AF table inside an ON clause of the AF, moving the join operation along with any other condition on AuxiliaryTable to a join within the AF, and adjusting a list of columns projected from the AF to include any columns from the AuxiliaryTable that are required for a list of columns projected from the query. Changing the order of the join operation may include keeping a join with the AuxiliaryTable as is in the query, joining the AuxiliaryTable with the AF table inside the ON clause of the AF, and copying the join condition along with any other condition on the AuxiliaryTable to the join within the AF. The AF properties may include “surplus”, which specifies whether the AF can accept additional columns beyond columns specified in an AF input table (FuncTable) and a behavior of the AF over any accepted additional columns, “rowIndependence”, which specifies whether the AF operates on each row independent of the others, and, for a single input row, that the AF produces zero output rows or one or more rows having the same values as a subset of the input row values, and/or “partitionIndependence”, which specifies whether the AF operates on each partition independent of the others, and, for a single input partition, that the AF produces zero output rows or one or more rows having the same values as rows in a subset of the partitioning columns values of the input partition. The query-level properties may include “uniqueColJoin”, which specifies whether the join operation between the AF table and the Auxiliary table involves a unique column in the Auxiliary table, and/or “projectionContainment”, which specifies whether a projection list for the query (QProjList) is contained in a projection list for the AF (FprojList). Determining that the plurality of AF properties includes a plurality of AF properties may include determining that the AF will accept more columns in addition beyond columns specified in an AF input table and that those columns will be returned from the AF, and the AF operates on each record independently of any other record. Inferring query-level properties about the query may include determining that the join operation between the AF table and the AuxiliaryTable involves a unique column. Determining that the plurality of AF properties includes a plurality of AF properties may include determining that the AF will accept more columns in addition beyond columns specified in an AF input table and that those columns will be returned from the AF, and the AF operates on each partition of a plurality of partitions independently of any other partition in the plurality of partitions. Inferring query-level properties about the query may include determining that the join operation between the AF table and the AuxiliaryTable involves a unique column and that the unique column is a partitioning column. Determining that the plurality of AF properties includes a plurality of AF properties may include determining that the AF will accept more columns in addition beyond columns specified in an AF input table and that those columns will be returned from the AF, and the AF operates on each record independently of any other record. Inferring query-level properties about the query may include determining that the join operation between the AF table and the AuxiliaryTable does not involve a unique column. Determining that the plurality of AF properties includes a plurality of AF properties may include determining that the AF will ignore columns beyond columns specified in an AF input table, and the AF operates on each record independently of any other record. Inferring query-level properties about the query may include determining that the query projects some columns from the AuxiliaryTable column. Determining that the plurality of AF properties includes a plurality of AF properties may include determining that the AF will ignore any columns in addition beyond columns specified in an AF input table, and the AF operates on each partition of a plurality of partitions independently of any other partition in the plurality of partitions. Inferring query-level properties about the query may include determining that the join operation between the AF table and the AuxiliaryTable involves a unique column and that the unique column is a partitioning column.
In one aspect, a non-transitory computer-readable tangible medium, on which is recorded a computer program, the computer program comprising executable instructions, that, when executed, performs a method including determining that the AF includes a plurality of AF properties. The method includes inferring query-level properties about the query. The method includes determining from the plurality of AF properties and query-level properties to change an order of the join operation.
The following detailed description illustrates embodiments of the present disclosure. These embodiments are described in sufficient detail to enable a person of ordinary skill in the art to practice these embodiments without undue experimentation. It should be understood, however, that the embodiments and examples described herein are given by way of illustration only, and not by way of limitation. Various substitutions, modifications, additions, and rearrangements may be made that remain potential applications of the disclosed techniques. Therefore, the description that follows is not to be taken as limiting on the scope of the appended claims. In particular, an element associated with a particular embodiment should not be limited to association with that particular embodiment but should be assumed to be capable of association with any embodiment discussed herein.
This disclosure describes techniques for optimizing the execution order between joins and AFs. A metadata-driven approach for capturing important AF properties that the query optimizer can leverage to push (or pull) joins around the AFs is proposed. Various scenarios under which the query re-writing is permissible are described. Different re-write strategies to handle different scenarios are described. The proposed class of optimizations can have a significant impact on both the queries' performance and the systems' resources, e.g., network and spool utilizations.
Join pushing is a well-known optimization technique used in many cases to benefit performance. For example, join pushing has been used to push join into views which have been created by UNION ALL operator. There is also some work in literature focusing on pushing joins before or after the GROUP BY operator according to cost models (memSQL). There are other optimizations that focus on optimizing black box functions within SQL queries. Most of this work addresses building profiles and cost models to better estimate the AF's cost given a specific input. Other work addresses the well-known predicate push optimization in the cases where the predicate involves AFs. In these situations, the predicate evaluation can be expensive and it may not be cost effective to push the predicate down in the query plan through other operators like joins. However, there is no previous work that addresses the execution order between joins and AFs.
An example of the optimizations enabled by the technique is now provided, along with an introduction to some terminology and a presentation of key properties.
Assume the query presented in
Some basic terminology used in this disclosure include (refer to
Properties for the AF, shown below and provided by the AF's developer, enable the proposed optimizations:
The semantics of these properties are shown in Table 1 below. Note that these properties may be a subset of the metadata information that form a “function descriptor,” described in co-pending U.S. patent application Ser. No. 16/704,802, filed on Dec. 5, 2019, entitled “Enabling Cross-Platform Query Optimization via Expressive Markup Language,” incorporated by reference.
The optimization techniques described in this disclosure are not applicable when the rowlndependence and partitionlndependence properties are both “Yes”.
At the query level, some additional properties, shown below, may be inferred to help check the validity of the proposed optimizations. These properties are inferred at query time from the database dictionary tables and the query itself. The properties include:
“uniqueColJoin”: “Yes” | “No”
“ProjectionContainment”: “Yes” | “No”
The semantics of these properties are defined in Table 2.
For the join push optimization, i.e., pushing a join operation from outside of the AF to be inside the AF,
Still referring to
The re-write steps in Strategy I include (refer to
The re-write steps in Strategy II include (refer to
Join pulling, which is the reverse of join pushing, relies on the same set of function-level and query-level properties. The rationale used for the different combinations of join pushing also applies to the join pulling re-writes and is not repeated.
The performance of the query presented in
In this experimental evaluation, as can be seen in
Further examples consistent with the present teaching are set out in the following numbered clauses.
Clause 1. A method for optimizing execution of a query invoking an analytical function (AF), wherein the query includes a join operation between an AF table and an AuxiliaryTable, the method comprising:
Clause 2. The method of clause 1 wherein changing the order of the join operation comprises:
Clause 3. The method of any preceding clause wherein changing the order of the join operation comprises;
Clause 4. The method of any preceding clause wherein the AF properties comprise:
Clause 5. The method of any preceding clause wherein the query-level properties comprise:
Clause 6. The method of any preceding clause wherein:
Clause 7. The method of any preceding clause wherein:
Clause 8. The method of any preceding clause wherein:
Clause 9. The method of any preceding clause wherein:
Clause 10. The method of any preceding clause wherein:
Clause 11. A non-transitory computer-readable tangible medium, on which is recorded a computer program, the computer program comprising executable instructions, that, when executed, perform a method comprising:
Clause 12. The computer program of clause 11 wherein changing the order of the join operation comprises:
Clause 13. The computer program of any of clauses 11-12 wherein changing the order of the join operation comprises;
Clause 14. The computer program of any of clauses 11-13 wherein the AF properties comprise:
Clause 15. The computer program of any of clauses 11-14 wherein the query-level properties comprise:
Clause 16. The computer program of any of clauses 11-15 wherein:
Clause 17. The computer program of any of clauses 11-16 wherein:
Clause 18. The computer program of any of clauses 11-17 wherein:
Clause 19. The computer program of any of clauses 11-18 wherein:
Clause 20. The computer program of any of clauses 11-19 wherein:
The operations of the flow diagrams are described with references to the systems/apparatus shown in the block diagrams. However, it should be understood that the operations of the flow diagrams could be performed by embodiments of systems and apparatus other than those discussed with reference to the block diagrams, and embodiments discussed with reference to the systems/apparatus could perform operations different than those discussed with reference to the flow diagrams.
The word “coupled” herein means a direct connection or an indirect connection.
The text above describes one or more specific embodiments of a broader invention. The invention also is carried out in a variety of alternate embodiments and thus is not limited to those described here. The foregoing description of an embodiment of the invention has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed. Many modifications and variations are possible in light of the above teaching. It is intended that the scope of the invention be limited not by this detailed description, but rather by the claims appended hereto.
This application claims the benefit of U.S. Provisional Patent Application No. 62/777,308, entitled “Optimized Analytics at Scale: Optimizing the Execution Order between Analytical Functions and Joins in SQL Queries,” filed on Dec. 18, 2018, which is incorporated by reference in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
7620615 | Milby | Nov 2009 | B1 |
20180357279 | Park | Dec 2018 | A1 |
20210019315 | Fender | Jan 2021 | A1 |
20210124743 | Choi | Apr 2021 | A1 |
Entry |
---|
WolframMathWorld; “Analytic Function—from Wolfram MathWorld”; https://web.archive.org/web/20181003214205/https://mathworld.wolfram.com/AnalyticFunction.html; Oct. 2, 2018 (Year: 2018). |
Number | Date | Country | |
---|---|---|---|
20200183935 A1 | Jun 2020 | US |
Number | Date | Country | |
---|---|---|---|
62777308 | Dec 2018 | US |