1. Technical Field
The field of the invention is data processing, or, more specifically, methods, apparatus, and products for monitoring and managing database queries for improving performance.
2. Description of Related Art
The development of the EDVAC computer system of 1948 is often cited as the beginning of the computer era. Since that time, computer systems have evolved into extremely complicated devices. Today's computers are much more sophisticated than early systems such as the EDVAC. Computer systems typically include a combination of hardware and software components, application programs, operating systems, processors, buses, memory, input/output devices, and so on. As advances in semiconductor processing and computer architecture push the performance of the computer higher and higher, more sophisticated computer software has evolved to take advantage of the higher performance of the hardware, resulting in computer systems today that are much more powerful than just a few years ago.
Information stored on a computer system is often organized in a structure called a database. A database is a grouping of related structures called ‘tables,’ which in turn are organized in rows of individual data elements. The rows are often referred to as ‘records,’ and the individual data elements are referred to as ‘fields.’ In this specification generally, therefore, an aggregation of fields is referred to as a ‘data structure’ or a ‘record,’ and an aggregation of records is referred to as a ‘table.’ An aggregation of related tables is called a ‘database.’
A computer system typically operates according to computer program instructions in computer programs. A computer program that supports access to information in a database is typically called a database management system or a ‘DBMS.’ A DBMS is responsible for helping other computer programs access, manipulate, and save information in a database.
A DBMS typically supports access and management tools to aid users, developers, and other programs in accessing information in a database. One such tool is the structured query language, ‘SQL.’ SQL is query language for requesting information from a database. Although there is a standard of the American National Standards Institute (‘ANSI’) for SQL, as a practical matter, most versions of SQL tend to include many extensions. Here is an example of a database query expressed in SQL:
This SQL query accesses information in a database by selecting records from two tables of the database, one table named ‘stores’ and another table named ‘transactions.’ The records selected are those having value “Minnesota” in their store location fields and transactions for the stores in Minnesota. In retrieving the data for this SQL query, an SQL engine will first retrieve records from the stores table and then retrieve records from the transaction table. Records that satisfy the query requirements then are merged in a ‘join.’
In many systems, the SQL queries are parsed, a logical plan created, and at least one, often multiple physical plans created for executing the logical plan to execute the SQL query. The multiple physical plans arrive at the same correct output, but can take greatly varying times to arrive at that output, depending on which plan is selected for execution. The best plan to execute is usually the plan having the lowest/cheapest expected cost, typically selected by the query optimizer.
In database query processing, the algorithms used by the query optimizer to implement the query are based on the ‘best’ plan that the optimizer selects using statistics over the underlying tables and columns. This is called the cost based model and is the defacto standard for databases.
One problem with this mechanism is that the chosen plan is selected based on the lowest expected cost. However, in practice, this selection process sometimes chooses a very inferior plan primarily because the available statistics fail to match reality during this execution. The resulting long running queries can be a major source of user frustration, troubleshooting, and support costs.
The problems of long running queries can be addressed by optimizing query plans and the problem of long running query optimizations can be addressed by storing optimized plans in a cache for re-use in the appropriate situations, should they arise again. Previously optimized plans can be re-optimized in an attempt to obtain better query processing times. However, this should not be done indiscriminately as this also uses resources. In addition, all previously optimized plans cannot be stored indiscriminately and forever, as this also requires the use of too many resources.
Improved methods for deciding how often to force re-optimization of a query and how many different plans to store for a query would be advantageous.
Methods, systems, and computer program products are provided for managing a database (DB) query system having a DB query plan repository, where the DB query plan repository can store more than one DB query plan for each DB query. One such method includes steps which, for each DB query plan for a DB query, determine a volatility score for the DB query plan, and for each DB query, determine a number of DB query plans to store for the DB query at least in part as a function of the DB query plan volatility score. In some embodiments, the DB query plan volatility score is determined at least in part as a function of a value contained in the DB query. The DB query plan volatility score may be determined at least in part as a function of a DB table statistic in some methods. The DB table statistic can be selected from at least one of the group of skew, cardinality, selectivity, clusteredness, and combinations thereof, depending on the embodiment.
In some embodiments, the DB query plan volatility score is determined at least in part as a function of actual run time data for the DB query plan. The actual run time data can be selected from at least one of minimum run time, maximum run time, and average run time. Some DB queries involve at least one index, in which the plan volatility score is determined at least in part as a function of the number of indices involved in the query. A DB query may involve at least one table, in which the plan volatility score is determined at least in part as a function of the number of tables involved in the query.
Some methods according to the present invention include displaying the plan volatility scores, and may include accepting user input to set a plan volatility score. In some methods, plan re-optimization is determined at least in part as a function of the plan volatility score and a threshold, where the threshold can be manipulated by the user. The number of plans stored can be determined at least in part by user input in some embodiments.
Some embodiments of the present invention assign a volatility score to each of the DB query plans. The volatility score can provide a numerical indication of how changes in the Host Variable Values (HVVs) affect the optimized plan. The volatility score can be used to determine how many individual plans should be stored for a given query in the plan cache. The volatility score can be used in conjunction with other plan scores to determine whether a plan should stay in a pseudo open mode or whether the plan should be re-optimized more frequently. In a pseudo-open mode the plan is essentially ready to run, and may have a cursor serving as an entry point into the query.
One practical application of the volatility score may be seen when working with commonly run customer queries over very large database (VLDB) queries. In one example, a very large join network with many tables having skewed data and correlated data is involved in the query. The best or “good enough” plan can be highly dependent upon the host variable values. Therefore, the volatility score along with another plan score can be used to urge more numerous optimized plans to be saved for the query.
In some embodiments, the SQL Query engine maintains a volatility score for every plan in the plan cache. The volatility score is externalized in some embodiments. Externalizing the score can allow users to force the optimizer to store more optimized plans for a given query. This can also be used to signal the optimizer to perform optimizations more often and/or to perform deeper optimizations, depending on the embodiment.
In one example of the invention, for each query, the optimizer determines the volatility of the underlying databases and the columns referenced. The volatility may be at least in part a function of factors such as data skew, correlation effect (e.g. month and month-name referenced in the same query), as well as how often the plan has been seen with differing host variable values. The volatility score may also take into account the maximum, minimum, and average execution time and/or the maximum, minimum, and average optimization time.
In some embodiments, when a query is in pseudo open mode, the optimizer can determine whether the host variable values have changed in such a way that would warrant a re-optimization or using a secondary plan stored in the plan repository. This determination can be based at least in part on the plan volatility score and/or database statistics, depending on the embodiment. When a query is not in pseudo open mode and is attempting to be matched to an existing plan, the optimizer may use the volatility score along with another plan score and the host variable values (and sub-combinations thereof) to determine whether another version of the optimized plan should be maintained.
Some embodiments of the present invention also include a system for processing database queries, the system including a computer processor and a computer memory operatively coupled to the computer processor. The computer memory can have disposed within it computer program instructions capable of executing the various methods described in the present application. Also provided is a computer program product for processing database queries, the computer program product disposed in a computer readable signal bearing medium. The computer program product includes computer program instructions capable of executing the various methods described in the present application.
The foregoing and other features and aspects of the invention will be apparent from the following more particular descriptions of exemplary embodiments of the invention as illustrated in the accompanying drawings, wherein like reference numbers generally represent like parts of exemplary embodiments of the invention.
In the example of
The arrangement of servers and other devices making up the exemplary system illustrated in
The exemplary SQL module 260 of
This plan represents database functions to scan through the stores table and, for each stores record, join all transactions records for the store. The transactions for a store are identified through the storeID field acting as a foreign key. The fact that a selection of transactions records is carried out for each store record in the stores table identifies the join function as iterative.
The exemplary plan generator 256 of
The exemplary plan generator 256 also includes an optimizer 254 implemented as computer program instructions that optimize the plan in dependence upon database management statistics 264. Optimizer 254 optimizes the execution of SQL queries against DBMS 250. Optimizer 254 is implemented as computer program instructions that optimize execution of a SQL query in dependence upon database management statistics 264. Database statistics are typically implemented as metadata of a table, such as, for example, metadata of tables of database 262 or metadata of database indexes. Database statistics may include, for example:
These three database statistics are presented for explanation only, not for limitation. Such database statistics can be used together with the values in a particular query to decide which physical plan to use and whether a new plan should be generated.
The exemplary SQL module 260 of
The SQL module 260 of
The SQL module 260 of
The computer 152 of
The exemplary computer 152 of
The example computer of
If no new plan is generated in light of the past run time history, the database statistics, or the HVV, then the risk is taken that the plan executed may take much longer to run than expected, for example, 5 minutes instead of 5 seconds. If a new plan is generated every time a query is received, then the cached plans are of little use and the time required to generate the new plans will itself slow down the query processing. Some embodiments provide ways to improve decision making as to whether or not a new plan should be generated.
Volatility scores can be stored which reflect the minimum, maximum, and average run times of previous executions of each plan in the plan cache. When a plan is found in the cache that is otherwise suitable for the query, the past run time data can provide a gauge of the volatility of the plan. For example, a min, max, and average run time which are close in range to each other would indicate a low degree of volatility or low volatility score, as would a max and average run time close to each other.
The database statistics can also be used to generate a volatility score. If the query operates on a certain column, then statistics for that column can be used to generate a volatility score the plan for the HVV being selected for in that column. In one example, if the HVV is found infrequently in the column, then a plan using an index may be beneficial. If the HVV is found frequently in the column, then a full table scan may be beneficial. In another example, if the table column is highly skewed, then the query plan execution is likely to be more volatile, and a plan utilizing an index may be called for. In some embodiments, the clusteredness may be used to affect the volatility score. Clustered data may be clustered together rather than evenly or randomly distributed. Clustered data may suggest a plan using an index and may increase volatility as the execution time may be more dependent on the HVV.
In some embodiments, the plan volatility is compared to a threshold acceptable volatility and a plan generated if the existing plan is more volatility than the threshold. In one example, different thresholds are associated with different query types, with some query types having a low threshold, and a low toleration for plan volatility. Some embodiments allow display of plan volatility to users and also allow user manipulation of plan volatility scores and/or acceptable volatility thresholds.
Some embodiments of the invention provide different plans for queries which differ only in one or more HVVs. Highly volatile plans may be highly variable in run time as a function of the HVV, which may call for different plans for different HVVs.
Step 404 can also be used to validate the existing plans, for example, to make sure that any indices relied on for the plan still exist. If the existing plan is invalid, then a new plan may be generated.
If an existing plan is acceptable, it can be selected in step 406. If not, a new plan can be generated in step 408. In step 410, the plan can be executed, with the query results returned to the user in step 412.
In step 414, the run time of this execution of the plan can be used to update the historical data for the executed plan, for example the minimum, maximum, and average run time. This data can also be used to update the volatility score for this plan.
In step 416, the plan can be stored in the plan repository if new, or the existing plan historical and volatility attributes updated. In some embodiments, the decision as to whether to store a new plan or even prune an existing plan can be made in step 416. Some embodiments have a limit on the number of plans to store for a query and a limit may be reached at this point, with the newest plan either not stored or an older plan purged to make room. The plan volatility scores can be used to determine how many plans to store, with high volatility scores suggesting a larger number of otherwise similar plans being stored for the same query. In some embodiments the size of the plan repository can be determined on the fly, in a step similar to step. In other embodiments, such decisions can be made asynchronously, by jobs running in the background.
Exemplary embodiments of the present invention are described largely in the context of a fully functional computer system for processing database queries. Readers of skill in the art will recognize, however, that the present invention also may be embodied in a computer program product disposed on signal bearing media for use with any suitable data processing system. Such signal bearing media may be transmission media or recordable media for machine-readable information, including magnetic media, optical media, or other suitable media. Examples of recordable media include magnetic disks in hard drives or diskettes, compact disks for optical drives, magnetic tape, and others as will occur to those of skill in the art. Examples of transmission media include telephone networks for voice communications and digital data communications networks such as, for example, Ethernets™ and networks that communicate with the Internet Protocol and the World Wide Web as well as wireless transmission media such as, for example, networks implemented according to the IEEE 802.11 family of specifications. Persons skilled in the art will immediately recognize that any computer system having suitable programming means will be capable of executing the steps of the method of the invention as embodied in a program product. Persons skilled in the art will recognize immediately that, although some of the exemplary embodiments described in this specification are oriented to software installed and executing on computer hardware, nevertheless, alternative embodiments implemented as firmware or as hardware are well within the scope of the present invention.
It will be understood from the foregoing description that modifications and changes may be made in various embodiments of the present invention without departing from its true spirit. The descriptions in this specification are for purposes of illustration only and are not to be construed in a limiting sense. The scope of the present invention is limited only by the language of the following claims.