This description relates generally to databases, and more particularly to adaptively optimizing parallel database query execution plans based on system memory load.
Database systems are used to store information and relationship data that can be queried to find individual pieces of information, related pieces of information or relations between pieces of information. A typical parallel database system includes a coordinator node, or multiple coordinator nodes, along with multiple data processing nodes interconnected by a network.
In general, the coordinator nodes form the front end of the system that interfaces with client systems by way of the same or another network, and coordinates with the data processing nodes. Typically, parallel database clients submit queries to the coordination nodes, or coordinators, which in turn dispatch the queries to the data nodes for execution.
In some existing distributed parallel database systems, for example, massively parallel processing (MPP) database systems, multiple coordinator nodes and multiple data nodes together form a cluster of computing systems. In distributed database systems the tables of a database typically are divided into multiple sections, or partitioned, and the resulting partitions reside on multiple data nodes in the cluster.
In general, in both traditional, single-node, non-distributed relational database management systems and distributed relational database management systems, when a database receives a query, such as a structured query language (SQL) query, from a client the database system compiles the query, creates and optimizes a query execution plan, and executes the query execution plan. The database system then generates query results and sends the results back to the client.
In typical parallel database systems, the query plan compilation and optimization is carried out by the coordinator node, and the query is executed in parallel on all the nodes. Upon receiving a query, a coordinator invokes a query compiler to create a semantic tree based on the query. The query is parsed using aggregated statistics in the global catalog as if the database were running on single computer. The coordinator then invokes a query planner that processes the semantic tree, creates and compares all possible query execution plans, and outputs an optimal query execution plan.
The query plan typically is subdivided into segments and parallelized for the number of distributed data nodes or data partitions in system. Some query segments are executed on the coordinator nodes, and other query segments are executed on the data nodes. Thus, the coordinator sends the latter query plan segments to the various data nodes in the cluster for execution. Typically, the coordinator node passes the same query plan segment, or segments, to each of the individual data nodes, all of which execute the same query execution plan segment, or segments, against the various stored data partitions.
With regard to any particular query, the query planner considers multiple candidate query execution plans, any one of which the parallel database system is capable of processing and generating the results. For example, a typical query execution plan consists of database operators such as join, sort and aggregation operators. As an example, with regard to the join operator there are different join algorithms, including hash join, nested loop join and sort-merge join.
Since each operator has differing efficiencies, even though all of the candidate plans are able to determine the appropriate final query output, the cost of executing each of the plans varies substantially. The query planner takes into consideration system resources, such as memory and table partitions statistics, when optimizing the algorithms for database operators. The optimizer function of the query planner on the coordinator node determines the optimal plan, for example, making a choice between an external merge sort operation and a quick sort operation, or deciding between a hash join operation and a nested loop join operation.
In some existing solutions, the concept of work memory, the amount of system memory area or space currently available for use regarding the query, drives the determination of the optimal execution plan. In general, existing solutions apply the concept of a fixed work memory to optimize query plans, without taking into consideration the discrepancies between loading of different data nodes over time. As a result, all of the data nodes typically execute the same plan segment, which is not always the optimal plan with respect to each of the data nodes.
Thus, due to factors such as non-uniform distribution of database table partitions across the various data nodes and the dynamic change of memory availability on different data nodes over time, the fixed work memory configuration sometimes results in a non-optimal query plan being selected for the data nodes. For example, given a system with substantial available memory, if the predetermined work memory is too small the query planner selects an external sort for a sorting operation, even though a quick sort operation under the circumstances could be more efficient.
Such optimization errors can result in general database performance degradation. As a result, some existing query optimization methodologies can have drawbacks when used in distributed parallel database systems, since database query performance is of relatively high importance.
According to one general aspect, a method for adaptively generating a query execution plan for a parallel database distributed among a cluster of data nodes includes receiving memory usage data from multiple data nodes including network devices, calculating a representative memory load corresponding to the data nodes based on the memory usage data, categorizing a memory mode corresponding to the data nodes based on the calculated representative memory load, calculating an available work memory corresponding to the data nodes based on the memory mode, and generating the query execution plan for the data nodes based on the available work memory. The memory usage data is based on monitored individual memory loads associated with the data nodes and the query execution plan is adapted to the currently available work memory.
According to another general aspect, a device for adaptively generating a query execution plan for a parallel database distributed among a cluster of data nodes includes an individual data node that includes an individual network device associated with the cluster configured to store at least a portion of data corresponding to the database and to receive a query execution plan segment, a memory load monitor associated with the individual data node and configured to monitor a memory load associated with the individual data node, and a local execution engine configured to execute the query execution plan segment.
The details of one or more embodiments of the invention are set forth in the accompanying drawings and the description below. Other features, objects, and advantages of the invention will be apparent from the description and drawings, and from the claims.
This disclosure describes a query plan optimization strategy for use in distributed relational database management systems in which query execution plans are adaptively determined based on current system memory availability. Instead of assuming a fixed work memory configuration, as in existing prior art technologies, the methods and devices described in this disclosure monitor the system load and memory availability on the distributed data processing nodes associated with the database cluster on a current and ongoing basis.
In an embodiment, a coordinator node determines the global work memory configuration using memory usage data received from memory load monitors on each of the data nodes and generates a query plan that is optimized for the current aggregate work memory available on the data nodes. In an alternative embodiment, each data node determines the local work memory configuration depending on the current memory usage and availability monitored at that node and modifies or re-optimizes the query plan for the current local work memory available on the data node. In the former embodiment the query plan is tailored to the cluster of data nodes, and in the latter embodiment the query plan is tailored for each of the individual data nodes.
As illustrated in
The storage devices 22, 24, 26 at the data nodes 16, 18, 20 each have stored a partition, or multiple partitions, of a distributed database table. Together, the storage devices 22, 24, 26 contain the information data for the complete database table.
In operation, the coordinator nodes receive query requests from a client node, or client, 32. As an example, referring still to
As a result of the working environment, including factors such as data skew or input/output (I/O), the memory usage and availability at the various data nodes 16, 18, 20 sometimes is uneven. In an embodiment, each of the data nodes 16, 18, 20 monitors the memory usage at the individual data node 16, 18, 20 and sends memory usage data to each of the coordinators 12, 14. The coordinators 12, 14 use the memory usage data from all of the data nodes 16, 18, 20 to determine an aggregate work memory that represents the average amount of memory currently available on each of the data nodes 16, 18, 20 to be dedicated to locally executing the query plan on the data nodes 16, 18, 20. The coordinators 12, 14 optimize the query plans, or the query plan segments, for globally optimal execution performance on all the data nodes 16, 18, 20, and send the same query plan, or query plan segments, to all of the data nodes 16, 18, 20.
Similarly, in an alternative embodiment each of the data nodes 16, 18, 20 monitors the memory usage at the individual data node 16, 18, 20. However, each of the individual data nodes 16, 18, 20 determines a local work memory that indicates the amount of memory currently available on the individual data node 16, 18, 20 to be dedicated to locally executing the query plan on the data node 16, 18, 20. Each of the individual data nodes 16, 18, 20 further performs localized query planning to adapt query plan segments received from one of the coordinators 12, 14 for optimal execution performance on the individual data node 16, 18, 20.
These implementations provide advantages with respect to existing solutions, which typically do not take actual system load or memory usage and availability variation among the data nodes into consideration, but rather presume a fixed work memory. By determining a more accurate work memory instead of a predetermined value, the implementations described in this disclosure can generate a more efficient query plan dynamically tailored to the actual working environment of the data nodes, and thus improve the overall performance of the distributed parallel database system.
Referring to
The query compiler 42 is configured to parse the received query request and create a semantic tree that corresponds to the query request. The global memory load calculator 44 optionally calculates a global memory load that represents, for example, the average current memory load on the data nodes that form the cluster using memory usage data received from all the data nodes.
The global memory mode categorizer 46 optionally assigns a global category, or mode, that indicates the approximate level of current memory usage or availability among the data nodes that form the cluster. The global memory mode categorizer 46 in some implementations maps the current average memory load among the data nodes to one of three categories, for example, LIGHT, NORMAL and HEAVY, according to how heavy the current global memory load is throughout the system.
For example, the global memory mode categorizer 46 assigns the LIGHT mode when average memory usage among all the data nodes is below thirty percent (30%) of the total system memory capacity, assign the NORMAL mode when average memory usage among all the data nodes is from thirty percent (30%) to seventy percent (70%) of the total system memory capacity, and assign the HEAVY mode when average memory usage among all the data nodes is above seventy percent (70%) of the total system memory capacity.
Based on the currently assigned memory mode, the global work memory calculator 48 optionally calculates the current global work memory for use in optimizing the query plan. The current global work memory corresponds to the average memory space available on each of the data nodes that form the cluster. For example, the global work memory calculator 48 in some implementations uses a memory load factor corresponding to the current memory mode, or category, to compute the available global work memory according to the following formula:
work_memory=system_memory_for_query×memory_load_factor
where
using the following definition for the memory load factor:
in addition to the following definitions:
As a result, when the memory mode is LIGHT, query plans are generated based on a larger work memory suitable for doing relatively memory-intensive operations like building a hash table or a sort operation. This can be desirable, because even though query execution plans computed with larger work memory will likely to consume more memory resources, the query plans generally will execute with a faster response. Conversely, when the memory mode is HEAVY, query plans are computed based on a smaller work memory.
On the other hand, when the memory mode is NORMAL, queries are differentiated based on one or more features of the query. That is, a query with a higher probability of being relatively memory-intensive will be assigned a larger size work memory for query planning, and a query with a lower chance of being relatively memory-intensive will be assigned a smaller size work memory for query planning. Accordingly, the optimizer adaptively plans queries based on the current memory load situation, achieving dynamic memory utilization and better execution performance.
The global query planner 50 creates multiple alternative candidate query plans and determine the optimal plan using the calculated global work memory. The selected query plan generally results in improved query execution performance with respect to fixed work memory solutions, because the calculated global work memory more accurately reflects the system resources currently available on the distributed data nodes.
The global query planner 50 further divides the query plan into multiple segments to be forwarded to the data nodes, and then send one or more of the optimized query plan segments to each of the data nodes to be locally executed on the data nodes. The global execution engine executes portions of the query plan segments on the coordinator node 40.
Referring to
The memory load monitor 62 monitors system memory usage and availability in the data node 60. In an implementation, the data node 60 periodically sends memory usage and availability information to all the coordinator nodes. As described above with regard to
In an alternative implementation, referring again to
For example, the local memory mode categorizer 64 assigns the LIGHT mode when memory usage on the data node 60 is below thirty percent (30%) of the data node 60 memory capacity, assign the NORMAL mode when memory usage on the data node 60 is from thirty percent (30%) to seventy percent (70%) of the data node 60 memory capacity, and assign the HEAVY mode when memory usage on the data node 60 is above seventy percent (70%) of the data node 60 memory capacity.
Based on the currently assigned local memory mode, the local work memory calculator 66 optionally calculates the current local work memory for use in adapting the plan segment to the current work environment at the data node 60. The current local work memory corresponds to the memory space available on the data node 60. For example, the local work memory calculator 66 in some implementations uses a memory load factor corresponding to the current memory mode, or category, to compute the available local work memory according to the following formula:
work_memory=system_memory_for_query×memory_load_factor
where
using the following definition for the memory load factor:
in addition to the following definitions:
The local query planner 68 modifies or re-optimizes the query execution plan segment, or segments, using the calculated local work memory in order to adapt the plan segment, or segments, to the current local work environment. The modified or re-optimized query plan segment generally results in improved query execution performance with respect to fixed work memory solutions, because the calculated local work memory more accurately reflects the system resources currently available on the data node 60. In any embodiment, the local execution engine 70 executes the query execution plan segment, or segments, on the data node 60.
With regard to
Thus, in various embodiments, the functions of the system 10 is executed on any suitable processor, such as a server, a mainframe computer, a workstation, a PC, including, for example, a note pad or tablet, a PDA, a collection of networked servers or PCs, or the like. Additionally, as modified or improved versions of the system 10 are developed, for example, in order to revise or add a template or country-specific information, software associated with the processor is updated.
In various embodiments, the system 10 is coupled to a communication network, which can include any viable combination of devices and systems capable of linking computer-based systems, such as the Internet; an intranet or extranet; a local area network (LAN); a wide area network (WAN); a direct cable connection; a private network; a public network; an Ethernet-based system; a token ring; a value-added network; a telephony-based system, including, for example, T1 or E1 devices; an Asynchronous Transfer Mode (ATM) network; a wired system; a wireless system; an optical system; a combination of any number of distributed processing networks or systems or the like.
The system 10 is coupled to the communication network by way of the local data links 58, 56, which in various embodiments incorporates any combination of devices—as well as any associated software or firmware-configured to couple processor-based systems, such as modems, access points, network interface cards, serial buses, parallel buses, LAN or WAN interfaces, wireless or optical interfaces and the like, along with any associated transmission protocols, as desired or required by the design.
An embodiment of the present invention communicates information to the user and request user input, for example, by way of an interactive, menu-driven, visual display-based user interface, or graphical user interface (GUI). The user interface is executed, for example, on a personal computer (PC) or terminal with a mouse and keyboard, with which the user interactively inputs information using direct manipulation of the GUI. Direct manipulation can include the use of a pointing device, such as a mouse or a stylus, to select from a variety of windows, icons and selectable fields, including selectable menus, drop-down menus, tabs, buttons, bullets, checkboxes, text boxes, and the like. Nevertheless, various embodiments of the invention incorporates any number of additional functional user interface schemes in place of this interface scheme, with or without the use of a mouse or buttons or keys, including for example, a trackball, a touch screen or a voice-activated system.
In an exemplary implementation of the system 10 of
As an example, memory load monitors associated with each of the data nodes 16, 18, 20 of
In an alternative implementation of the system 10 of
As an example, memory load monitors associated with each of the data nodes 16, 18, 20 of
Referring now to
In block 82, the received query is parsed, and in block 84 a semantic tree corresponding to the query is compiled. Multiple candidate query execution plans are created, in block 86, based on the semantic tree. Current memory usage or availability information regarding the individual data nodes are received in block 88, and in block 90 the current global memory load is calculated as described above using the received memory usage or availability data. In block 92, the memory mode is assigned to an appropriate category, as described above, corresponding to the current global memory load. The available global work memory is computed as described above, in block 94, and used in block 96 to optimize the query execution plan selected from among the candidate plans, as described above.
In block 98, the query execution plan is divided into multiple segments for distribution to the data nodes, and in block 100 the same query execution plan segment, or segments, is transmitted to all of the data nodes in the database cluster. Additionally, the compiled semantic tree is forwarded to the data nodes in block 102.
Referring now to
In block 114, the current memory usage or availability of an individual data node is monitored. Optionally, in block 116 memory usage or availability information periodically is sent, for example, to all coordinator nodes. In block 118, the local memory mode is optionally assigned to a category, as described above, corresponding to the current memory usage or availability
The available local work memory is computed as described above, in block 120, and used in block 122 to modify or re-optimize the query execution plan segment, or segments, as described above. In block 124, the query plan segment, or segments, is executed on the data node.
In an exemplary implementation of the system 10 of
In an alternative implementation of the system 10 of
As an example, the following query request is received by one of the coordinator nodes 12, 14 of
select count(*) from lineitem,part where l_partkey=p_partkey group by l_partkey;
In response, the coordinator 12 generates the following query execution plan segment and send the segment to the three data nodes 16, 18, 20 of
The first three lines of the query plan segment are executed on the coordinator 12, while the aggregation and join operations are executed on each of the data nodes 16, 18, 20 in accordance with the current local memory mode category assigned to each of the data nodes 16, 18, 20 in light of the current work environment at the corresponding individual data nodes 16, 18, 20. Thus, for example, if the local memory mode of data node 16 currently is assigned to the HEAVY category, the data node 16 re-optimizes the query plan to carry out a sort-based aggregation operation and a nested loop join operation. At the same time, if the local memory modes of the data node 18 and the data node 20 currently are assigned to the LIGHT and NORMAL categories, respectively, the data nodes 18, 20 each instead re-optimizes the query plan to carry out a hash aggregation operation and a hash join operation.
Use of the adaptive query planning methodology described in this disclosure, which implements a dynamically calculated work memory configuration reflecting the current system load, results in improved query execution efficiency or performance with respect to solutions using fixed work memory configuration. By using the more accurate work memory configuration, rather than a predetermined, or fixed, value, the adaptive query planner can generate a modified or optimized query plan tailored to the current work environment at the data nodes, resulting in improved performance of the distributed parallel database system, reduced query response time, improved memory resource utilization and reduced data spilling.
Aspects of this disclosure are described herein with reference to flowchart illustrations or block diagrams, in which each block or any combination of blocks can be implemented by computer program instructions. The instructions are provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to effectuate a machine or article of manufacture, and when executed by the processor the instructions create means for implementing the functions, acts or events specified in each block or combination of blocks in the diagrams.
In this regard, each block in the flowchart or block diagrams corresponds to a module, segment, or portion of code that including one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functionality associated with any block can occur out of the order noted in the figures. For example, two blocks shown in succession can, in fact, be executed substantially concurrently, or blocks can sometimes be executed in reverse order.
A person of ordinary skill in the art will appreciate that aspects of this disclosure can be embodied as a device, system, method or computer program product. Accordingly, aspects of this disclosure, generally referred to herein as circuits, modules, components or systems, can be embodied in hardware, in software (including firmware, resident software, micro-code, etc.), or in any combination of software and hardware, including computer program products embodied in a computer-readable medium having computer-readable program code embodied thereon.
In this respect, any combination of one or more computer readable media can be utilized, including, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of these. More specific examples of computer readable storage media would include the following non-exhaustive list: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM), a Flash memory, a portable compact disc read-only memory (CD-ROM), an optical storage device, network-attached storage (NAS), a storage area network (SAN), magnetic tape, or any suitable combination of these. In the context of this disclosure, a computer readable storage medium can include any tangible medium that is capable of containing or storing program instructions for use by or in connection with a data processing system, apparatus, or device.
Computer program code for carrying out operations regarding aspects of this disclosure can be written in any combination of one or more programming languages, including object oriented programming languages such as Java, Smalltalk, C++, or the like, as well as conventional procedural programming languages, such as the “C,” FORTRAN, COBOL, Pascal, or the like. The program code can execute entirely on an individual personal computer, as a stand-alone software package, partly on a client computer and partly on a remote server computer, entirely on a remote server or computer, or on a cluster of distributed computer nodes. In general, a remote computer, server or cluster of distributed computer nodes can be connected to an individual (user) computer through any type of network, including a local area network (LAN), a wide area network (WAN), an Internet access point, or any combination of these.
It will be understood that various modifications can be made. For example, useful results still could be achieved if steps of the disclosed techniques were performed in a different order, and/or if components in the disclosed systems were combined in a different manner and/or replaced or supplemented by other components. Accordingly, other implementations are within the scope of the following claims.