Typical database systems receive queries to retrieve information from data sources managed by the database system. In a relational database system these data sources are typically organized into a series of tables. Queries are received in a standard format such as SQL.
Most databases use an optimizer that attempts to generate an optimal query execution plan. When a query is issued with a USING statement the optimizer in most cases ignores the data values associated with the USING statement. The optimizer assumes default selectivity and produces a generic conservative plan. Because it is a generic plan it is cached and is reused from the cache even for different data values associated with the USING statement.
There are many problems with such generic plans. Most of these problems result in sub optimal plans. For example, access tends to be a full file scan instead of indexed access. Another problem is that joins tend to be a sort merge instead of a nested loop.
If the optimizer were to take into consideration data values associated with a USING statement in a query then the optimizer can generate more aggressive and optimal plans for the query. The issue with taking data values associated with a USING statement in plan generation is that the same plan cannot be reused unless the query is reissued with the same exact data values associated with the USING statement. Therefore there is little point in the caching of such plans.
Described below is a method of selecting for use a stored execution plan for a dynamic SQL query within a database system. Respective selectivity values are maintained that are associated with one or more predicates in the dynamic SQL query for respective historical data values. Respective confidence level values are maintained that are associated with one or more of the selectivity values. One or more data values are received with which to execute the dynamic SQL query. Respective selectivity values are calculated for one or more of the predicates in the dynamic SQL query for the received data value(s). The stored selectivity values are compared with respective corresponding calculated selectivity values. A stored execution plan is selected for use on detecting substantial equality between the respective pairs of compared values.
Other types of database systems, such as object-relational database management systems (ORDMS) or those built on symmetric multi-processing (SMP) platforms are also suited for use here.
The data warehouse 100 includes one or more processing modules 1051 . . . N that managed the storage and retrieval of data and data storage facilities 1101 . . . N. Each of the processing modules 1051 . . . N manages a portion of a database that is stored in a corresponding one of the data-storage facilities 1101 . . . N. Each of the data-storage facilities 1101 . . . N includes one or more disk drives.
The system stores data in one or more tables in the data-storage facilities 1101 . . . N. The rows 1151 . . . Z of the tables are stored across multiple data-storage facilities 1101 . . . N to ensure that the system workload is distributed evenly across the processing modules 1051 . . . N. A parsing engine 120 organizes the storage of data and the distribution of table rows 1151 . . . Z among the processing modules 1051 . . . N. The parsing engine 120 also coordinates the retrieval of data from the data-storage facilities 1101 . . . N over network 125 in response to queries received from a user at a mainframe 130 or a client computer 135 connected to a network 140. The database system usually receives queries and commands to build tables in a standard format, such as SQL.
Once the session control 200 allows a session to begin, a user may submit an SQL request, which is routed to the parser 205. As illustrated in
If an existing plan is found by the stored plan checker 305 then some of the following steps can be skipped as indicated by alternate path 310.
The parser checks the request for proper SQL syntax (block 315), evaluates it semantically (block 320) and consults a data dictionary to ensure that all of the objects specified in the SQL request actually exist and the user has the authority to perform the request (block 325). Finally, the parser 205 runs an optimizer (block 330) that develops the least expensive plan to perform the request.
The query processing architecture described above for most relational database systems is divided into a compile time sub-system 120, 205, to parse and optimize the SQL request and a separate run time sub-system implemented by processing modules 1051 . . . N to interpret the plan and execute the query.
Set out below is a typical query that could be used to interrogate table 400.
The above SQL statement is known as a static SQL statement. The full text of this statement is known at compile time and the statement does not change from execution to execution.
The above static query can be rewritten as a USING query. The advantage of a USING query is that the search term dept_name is not hard coded but any department can be issued at run time. The query is able to search for dept_name values other than ‘Software Development’.
The following sets out a USING query. The USING variable can be used with different values when the query is issued. For example the USING value can be bound to ‘Software Development’ for one query and to ‘Human Resources’ for another query.
In some types of SQL statements, known as dynamic SQL statements, the full text of the statement is unknown until run time. Such statements can and probably will change from execution to execution.
A dynamic SQL feature is a technique for generating and executing SQL commands dynamically at run time. Dynamic SQL queries are prepared at program execution time, not compilation time. This means that the application program compiler cannot check for errors at compilation time. It also means that programs can create specialized queries in response to user or other input. Some programs must build and process SQL statements where some information is not known in advance. A reporting application might build different SELECT statements for the reports it generates, substituting new table and column names and ordering or grouping by different columns. Such statements are called dynamic SQL statements.
In a dynamic SQL statement the column names for example may not be known at application program compile time. Depending on the user request the column of interest would be determined and the query built in a piecemeal fashion by the application based on user input and based perhaps on analysis and even database accesses.
Dynamic SQL statements contain place holders for bind arguments. Where an SQL statement includes several bind arguments, all bind arguments can be placed in the SQL statement with a using clause.
The optimizer 300 described above has access to statistics previously requested by the user to be collected on one or more of the tables stored on data-storage facilities 110.
Statistics 500 are typically stored in a data dictionary. The statistics include an initial interval 505 that is also referred to as interval zero. Interval zero includes basic or general information about the table and includes, for example, a collection date 510 representing the date the statistics were collected, general table information 515, a minimum value 520 representing the smallest value in the column of table 400, a row count 525 representing the total count or number of rows in the table, and a null count 530 representing the total number of null values in the table.
Following interval zero is data representing each of the 100 intervals, indicated as 5401, 5402 and so on to 540100. Each interval 5401 . . . 100 in turn includes the mode value 5451 . . . 100 representing the most frequently occurring value in that interval and the number or frequency 5501 . . . 100 of those occurrences, the number of other unique values 5551 . . . 100 in that interval, the total number 5601 . . . 100 of those occurrences and the maximum value 5651 . . . 100 representing the highest value in that interval. It will be appreciated that the statistics 500 in some circumstances include other statistical data 5701 . . . 100 depending on the purpose for which the statistics are collected.
The minimum, maximum and distribution of values are used to compute the selectivity of a value. The optimizer can then use the selectivity of values when it determines query cost estimates. Selectivity is an estimate of the percentage of rows that will be returned by a filter in a query. Selectivity values typically range from 0.0 to 1.0, where 0.0 indicates a very selective filter that passes very few rows and 1.0 indicates a filter that passes almost all rows. The optimizer uses selectivity information to reorder expressions in the query predicate so that filters that are expensive to call given the values of their arguments are evaluated after filters that are inexpensive to call. One example in a query is whether to perform a join first or an aggregation first. Therefore the optimizer reduces the number of comparisons and improves performance.
When a dynamic SQL statement having a using clause is received, it is necessary to determine whether or not the plan generated with the using data values can be reused. If it is determined that the plan can be reused, it is sensible for the system to cache the plan. Otherwise it is not sensible to cache the plan. A cached plan must therefore include a plurality of pass-fail criteria besides just the execution plan.
The data values to be inserted into the dynamic SQL query with the using statement have a data parcel size. One technique involves examining the data parcel size. Once the using data values are bound in the dynamic query, the query will have a certain length known as a data parcel size. A previously stored plan would have been generated with one set of using data values. The new set of using data values are bound into the dynamic query and the length of the bound query is compared with the previously bound length and type of value of the cached plan. If the data size is the same or very similar, this is one indicator that the new using data values could be similar to the old using data values and the cached plan may be reused for executing the dynamic query. One problem with solely relying on this comparison is that there can be a lot of selectivity variation given a small variation in data parcel size.
Another technique is to determine the extent to which a using value is going to affect a stored plan. In other words this involves determining the sensitivity of a stored plan to changes in using values. This can be estimated by the selectivity of each predicate along with its confidence level. As described above, the selectivity of a predicate is an estimate of the percentage of rows that will be returned by a predicate in a query. The confidence level is an indicator of the number of rows that the statistics compiler has scanned when compiling the statistics. A high confidence level value indicates that a high number of rows of the database have been scanned.
If the selectivity of every using predicate has no confidence associated with it then this will lead to conservative plans. Conservative plans will not actively filter results during query execution and will return a larger percentage of rows of the table as results. Such plans can be reused for other using data values also.
If the selectivity of each predicate has low confidence then this means that a small number of rows have been scanned in order to estimate the selectivity of the predicates. In this case the data parcel size described above as well as a selectivity sensitivity threshold explained below can be used to determine whether the plan can be reused.
If the confidence level of each predicate in the cached plan is high then the selectivity with the new using data values should be calculated and if each of the respective calculated selectivity values are the same as the cached selectivity values then the plan can be reused.
As an alternative to testing for strict equality of selectivity values, this criteria can be relaxed by using a threshold. When an original plan is generated, each of the selectivity values generated with high confidence can be annotated with the extent to which it impacts the plan. In other words it is possible to store along with each selectivity value a threshold value within which the generated plan does not change. This threshold can be used when a calculated selectivity is compared with a cached selectivity. This is known as a selectivity sensitivity threshold.
A sensitivity threshold could specify for example that calculated selectivity values of between 10 and 50 mean that the plan can be reused. Calculated selectivity values of lower than 10 or greater than 50 mean that the plan cannot be reused and a different plan should be used.
The selectivity sensitivity thresholds can be determined by experiment by executing the dynamic query with different using values. The particular plan that is executed given the varying using values can be used to determine the selectivity sensitivity threshold.
Some database systems use partitions. A further test to determine whether or not a cached plan can be reused is to calculate the number of partitions scanned. If it is determined that two high confidence selectivity predicates can result in differing numbers of partitions being statically eliminated then the partition ranges that are eliminated can also be stored.
The text above describes one or more specific embodiments of a broader invention. The invention also is carried out in a variety of alternative embodiments and thus is not limited to those described here. Those other embodiments are also within the scope of the following claims.