When executing a request (e.g., a query or a utility), a database processing system often has more than one processing plan that it can follow. The database processing system typically uses an optimizer to choose the best processing plan. Traditionally, the optimizer chooses the lowest cost plan. In a database processing system that operates on relations (e.g., tables), cost is measured by, for example, the number of output rows in the result, the amount of CPU time that will be consumed by executing the request, the amount of memory that will be consumed by executing the request, etc.
It has also been recognized that semantic information stored in the database as, for example, integrity constraints could be used by the optimizer in choosing the best processing plan. For example, it may be possible to optimize the execution of a query by eliminating a join based on a foreign key constraint. This form of optimization is called semantic query optimization.
In general, in one aspect, the invention features a method for optimizing the execution of a SQL request on a database system. The database system has a state. The request has characteristics and the state has characteristics. The method includes parsing the SQL request to create a logical plan, enumerating the logical plan into a plurality of physical plans, costing the plurality of physical plans using current environmental and data demographics of the database system to produce cost information, and selecting one of the plurality of physical plans to execute the request using the cost information and workload profile information. Workload profile information includes one or more characteristics of the request and the state of the database system. The method further includes executing the request using the selected physical plan to produce results.
Implementations of the invention may include one or more of the following. Using the workload profile information may include using characteristics of the request including one or more of the following: (a) service level goals and priorities, and (b) application types. Selecting a plan to execute the request using the workload profile information may include using characteristics of the state of the database system including one or more of the following: (a) memory availability, (b) task availability, (c) disk availability, (d) CPU availability, (e) lock granularity, (f) type of scan that is being run by the system for another request, and (g) request arrival rate. Enumerating the logical plan into a plurality of physical plans may include consideration of one or more of the following types of joins in the request, one or more access paths available for the tables join geographies, and the order of execution of the joins in the request.
In general, in another aspect, the invention features a method for optimizing the execution of a SQL request on a database system. The database system has a state. The SQL request has been parsed to create a logical plan. The logical plan has been enumerated into a plurality of physical plans. Each of the plurality of physical plans has been costed using current environmental and data demographics of the database system to produce cost information. The method includes selecting one of the plurality of physical plans to execute the query using the cost information and workload profile information. Workload profile information includes one or more characteristics of the request and one or more characteristics of the state of the database system.
In general, in another aspect, the invention features a computer program, stored on a tangible storage medium, for use in optimizing the execution of a SQL request on a database system. The database system has a state. The request has characteristics and the state has characteristics. The program includes executable instructions that cause a computer to parse the SQL request to create a logical plan, enumerate the logical plan into a plurality of physical plans, cost the plurality of physical plans using current environmental and data demographics of the database system to produce cost information, and select one of the plurality of physical plans to execute the request using the cost information and workload profile information. Workload profile information includes one or more characteristics of (a) the request and (b) the state of the database system. The program further includes executable instructions that cause the computer to execute the request using the selected physical plan to produce results.
In general, in another aspect, the invention features a computer program, stored on a tangible storage medium, for use in executing a SQL request on a database system. The database system has a state. The request has characteristics and the state has characteristics. The SQL request has been parsed to create a logical plan. The logical plan has been enumerated into a plurality of physical plans. Each of the plurality of physical plans has been costed using current environmental and data demographics of the database system to produce cost information. The program includes executable instructions that cause a computer to select one of the plurality of physical plans to execute the query using the cost information and workload profile information. Workload profile information includes one or more characteristics of the request and the state of the database system.
In general, in another aspect, the invention features a system including a massively parallel processing system including one or more nodes, a plurality of CPUs, each of the one or more nodes providing access to one or more CPUs, a plurality of data storage facilities, each of the one or more CPUs providing access to one or more data storage facilities, and a process for optimizing the execution of a SQL request on a database system. The database system has a state. The state has characteristics and the request has characteristics. The process includes parsing the SQL request to create a logical plan, enumerating the logical plan into a plurality of physical plans, costing the plurality of physical plans using the current environmental and data demographics of the database system to produce cost information, and selecting one of the plurality of physical plans to execute the request using the cost information and workload profile information. Workload profile information includes one or more characteristics of the request and the state of the database system. The process further includes executing the request using the selected physical plan to produce results.
In general, in another aspect, the invention features a system including a massively parallel processing system including one or more nodes, a plurality of CPUs, each of the one or more nodes providing access to one or more CPUs, a plurality of data storage facilities, each of the one or more CPUs providing access to one or more data storage facilities, and a process for optimizing the execution of a SQL request on a database system. The database system has a state. The state has characteristics and the request has characteristics. The SQL request has been parsed to create a logical plan. The logical plan has been enumerated into a plurality of physical plans. Each of the plurality of physical plans has been costed using the current environmental and data demographics of the database system to produce cost information. The method includes selecting one of the plurality of physical plans to execute the query using the cost information; and workload profile information. Workload profile information includes one or more characteristics of the request and the state of the database system.
In general, in another aspect, the invention features a method for optimizing the execution of a SQL request on a database system. The database system has a state. The state has characteristics and the request has characteristics. The method includes parsing the SQL request to create a logical plan, enumerating the logical plan into a plurality of physical plans, costing the plurality of physical plans using the current environmental and data demographics of the database system and workload profile information to produce cost information. Workload profile information includes one or more characteristics of the request and the state of the database system. The method further includes selecting one of the plurality of physical plans to execute the request using the cost information and executing the request using the selected physical plan to produce results.
The technique for profile based optimization disclosed herein has particular application, but is not limited, to large databases that might contain many millions or billions of records managed by a database system (“DBMS”) 100, such as a Teradata Active Data Warehousing System available from NCR Corporation.
For the case in which one or more virtual processors are running on a single physical processor, the single physical processor swaps between the set of N virtual processors.
For the case in which N virtual processors are running on an M-processor node, the node's operating system schedules the N virtual processors to run on its set of M physical processors. If there are 4 virtual processors and 4 physical processors, then typically each virtual processor would run on its own physical processor. If there are 8 virtual processors and 4 physical processors, the operating system would schedule the 8 virtual processors against the 4 physical processors, in which case swapping of the virtual processors would occur.
Each of the processing modules 1101 . . . N manages a portion of a database that is stored in a corresponding one of the data-storage facilities 1201 . . . N. Each of the data-storage facilities 1201 . . . N includes one or more disk drives. The DBMS may include multiple nodes 1052 . . . O in addition to the illustrated node 1051, connected by extending the network 115.
The system stores data in one or more tables in the data-storage facilities 1201 . . . N. The rows 1251 . . . Z of the tables are stored across multiple data-storage facilities 1201 . . . N to ensure that the system workload is distributed evenly across the processing modules 1101 . . . N. A parsing engine 130 organizes the storage of data and the distribution of table rows 1251 . . . Z among the processing modules 1101 . . . N. The parsing engine 130 also coordinates the retrieval of data from the data-storage facilities 1201 . . . N in response to queries received from a user at a mainframe 135 or a client computer 140. The DBMS 100 usually receives queries and commands to build tables in a standard format, such as SQL.
In one implementation, the rows 1251 . . . Z are distributed across the data-storage facilities 1201 . . . N by the parsing engine 130 in accordance with their primary index. The primary index defines the columns of the rows that are used for calculating a hash value. The function that produces the hash value from the values in the columns specified by the primary index is called the hash function. Some portion, possibly the entirety, of the hash value is designated a “hash bucket”. The hash buckets are assigned to data-storage facilities 1201 . . . N and associated processing modules 1101 . . . N by a hash bucket map. The characteristics of the columns chosen for the primary index determine how evenly the rows are distributed.
In one example system, the parsing engine 130 is made up of three components: a session control 200, a parser 205, and a dispatcher 210, as shown in
Once the session control 200 allows a session to begin, a user may submit a SQL request, which is routed to the parser 205. As illustrated in
The optimizer (block 320) performs a number of processes, examples of which are illustrated in
In addition, as part of enumerating the logical plan the optimizer (block 320) may rewrite some requests to generate additional plan options. For example, semantic optimization can be used to generate plans with a join removed because of a primary key/foreign key (PK/FK) definition in the data model. For example, the following query:
The optimizer (block 320) may cost (block 425) the enumerated physical plans 420 using current system data 430 to produce cost information for the physical plans 435. Traditionally, the current system data includes environmental information (the number of nodes, the number of AMPS, interconnection information, memory, etc.), data demographics (row size, table cardinality, column demographics (skew, unique values, etc.), etc.), and rules/heuristics (join tables in the order they appear in the FROM clause, etc.). Traditionally, the optimizer (block 320) selects (block 440) the lowest cost physical plan using the cost information for the physical plans 435 to produce a selected plan 445.
An improved optimizer (block 320) may use workload profile information 450 to refine the cost model and to aid in selecting the physical plan to execute (block 440). Workload profile information 450, illustrated in
Workload profile information 450 also may be used by a database processing system to manage processing as described in the applications referenced in the “Cross Reference to Related Applications” section set out above. The systems described in those applications use the workload profile information 450 to determine how to apply system resources to received requests based on characteristics of the user (user information) and the request (request information) and based on the availability and status of system resources (system information). This same workload profile information may be used by the optimizer.
Characteristics of the request 605 may include request information and user information. Request information may include one or more of the following: an indication that the request is strategic, and an indication that the request is tactical. As an example of how such information is used in optimization, if a user that normally sends strategic requests sends a tactical request, the optimizer may select a plan that uses fine granular locking at a higher cost over a lower cost plan.
User information may include one or more of the following: user identification, account identification, identification of the application that produced the request, client identification, client address, client profile, and other information about the user. As an example of how such information is used in optimization, the system may receive a request from a user identified in the user information as a tactical user, i.e., a user that needs quick request turnaround. In that case, the optimizer may select a plan that uses fine granular locking at a higher cost over a lower cost plan.
Characteristics of the state of the database system may include one or more of the following: hardware configuration, memory availability, task availability, cache consumption, request arrival rate, skew, blocking, spool, CPU load, input/output load, network load, and other system information. As an example of how such information is used in optimization, if the system is short on disk space, the optimizer may select a physical plan that uses less temporary disk storage over a lower cost plan. As another example, the optimizer may decide that a table scan access join plan is preferable when a synchronized plan is already executing. As another example, if the system is already executing a query by running a sync scan, the optimizer may choose a sync scan over a random access plan for the request under consideration.
Once a plan is selected, the request may be executed following the selected plan (block 450) producing results 455.
In the example approach illustrated in
In an alternative approach, illustrated in
The text above described one or more specific embodiments of a broader invention. The invention also is carried out in a variety of alternative embodiments and thus is not limited to those described here. The foregoing description of the preferred embodiment of the invention has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed. Many modifications and variations are possible in light of the above teaching. It is intended that the scope of the invention be limited not by this detailed description, but rather by the claims appended hereto.
This application is related to U.S. patent application Ser. No. 10/730,348, filed Dec. 8, 2003, entitled Administering the Workload of a Database System Using Feedback, by Douglas P. Brown, Anita Richards, Bhashyam Ramesh, Caroline M. Ballinger and Richard D. Glick, NCR Docket No. 11167; this application is related to U.S. patent application Ser. No. 10/786,448, filed Feb. 25, 2004, entitled Guiding the Development of Workload Group Definition Classifications, by Douglas P. Brown, Bhashyam Ramesh and Anita Richards, NCR Docket No. 11569; this application is related to U.S. patent application Ser. No. 10/889,796, filed Jul. 13, 2004, entitled Administering Workload Groups, by Douglas P. Brown, Anita Richards, and Bhashyam Ramesh, NCR Docket No. 11560; this application is related to U.S. patent application Ser. No. 10/915,609, filed Jul. 13, 2004, entitled Regulating the Workload of a Database System, by Douglas P. Brown, Bhashyam Ramesh, and Anita Richards, NCR Docket No. 11561; this application is related to U.S. patent application Ser. No. 11/254,374, filed Oct. 20, 2005, entitled Identifying Database Request Sources, by Douglas P. Brown, Anita Richards, and Bhashyam Ramesh, NCR Docket No. 11650; this application is related to U.S. patent application Ser. No. 11/295,409, filed Dec. 6, 2005, entitled A Closed-Loop Supportability Architecture, by Douglas P. Brown, Anita Richards, and Bhashyam Ramesh, NCR Docket No. 11651; this application is related to U.S. patent application Ser. No. 11/334,615, filed Jan. 18, 2006, entitled A Closed-Loop Validator, by Douglas P. Brown, Anita Richards, and Bhashyam Ramesh, NCR Docket No. 11654; this application is related to U.S. patent application Ser. No. 11/435,523, filed May 17, 2006, entitled Managing Database Utilities to Improve Throughput and Concurrency, by Anita Richards, Douglas P. Brown, Bruce Wayne Britton, and Todd A. Walter, NCR Docket No. 11646.