Accurate workload characterization is useful in meeting quality of service (QoS) goals in large scale database systems. Large scale database systems may be used opaquely, as in a cloud setting, or used more transparently in online transaction processing (OLTP) and decision support systems (DSS). Workload characterization typically means classifying a workload of database queries based on expected response times. Typical workload management tools use timeout thresholds for a query to complete based on the characterization; otherwise, the query may be aborted. As such, inaccurate classification wastes resources and limits how many queries can be executed, causing queue buildup, and resulting in a less robust computing system. In these computing environments, the ability to predict response times helps in managing the system. Workload management tools use workload characterization to prioritize queries, and to allocate resources. Currently, these tools use costs, estimated by the optimizer, to classify queries. However, when cost models change, or changes are made to cost calculations, the estimated costs change, and a workload management tool's classification logic also changes.
Certain examples are described in the following detailed description and in reference to the drawings, in which:
Examples of the claimed subject matter classify database queries of a workload to improve Quality of Service (QoS) goals in a large-scale database environment. An augmented set of query plan data items is generated by a modified database optimizer; and, these additional data items are used to classify the queries. This differs from the current approach that relies solely on the estimated cost of a query for classification. Machine learning techniques are also used, which represents an improvement over the existing heuristic-based approaches.
SQL optimizers estimate the cost of a query using many variables such as, statistics of tables, selectivity of predicates, estimated number of input/outputs (I/Os), and estimated amount of memory used at runtime. Internally, SQL optimizers process a variety of plans before choosing the plan with the lowest cost for execution. However, identifying the plan with the lowest cost is challenging; and, the cost is merely an estimate. In some cases, the cost estimates may be incorrect, and the wrong plan may be selected. This may be due to a design limitation, or possibly, a defect in the optimizer engine code. Thus, because the estimated costs may be wrong, workload characterization may incorrectly classify database queries, causing a waste of database resources.
A query plan for a given query is a tree of operators. Thus, features are collected at the query level, as well as the operator level. Some query level features include complexity of the query, total estimated memory, total estimated number of rows read from disk, total estimated number of rows exchanged between processes, missing statistics, total estimated sequential input/outputs (IOs), and total estimated random IOs; although, there are many more. At the operator level, it may be enough to identify a bad pattern. Bad patterns lead to high costs during execution. Thus, a predetermined set of anomalous operators may be identified. These anomalous operators increase the risk of long running times for the query 106. In one example implementation, anomalous operator examples include: a mergejoin with both children sorted; a hash join that does a cross product join; a nested join without predicates on key columns of the scan on the right side; and a hash group-by inside a disk process where memory is limited or memory is at premium.
In an example implementation, about forty features were collected. The features were based on bad plan patterns observed while analyzing different workloads. The selection of features was informed by the domain knowledge of database engineers. The optimizer was modified to collect operator level details of the plan that could be used as the features for classification. For example, for a scan operation, a few exemplary output features include the estimated number of bytes fetched from disk, the estimated number of bytes projected after filtering, and the number of anomalous scans. Anomalous scans are marked by the database engineer based on experiences gained from solving customer escalations.
Another example collection of features is related to detecting existing skew among the data distribution of a predetermined set of SQL table columns. In a parallel plan, the skew may cause a high number of skewed rows to be partitioned to a single process, making the process a bottle-neck, adversely affecting the execution time of the query. The skew in such a scenario may be due to the distribution of data among parallel executing processes during query execution. When such a possible skew is detected by the optimizer, the cited skew imbalance is input to the classification process as a feature. Additional factors include: the total number of anomalous join operators, total estimated CPU usage, estimated number of tuples sent by all operators, and the estimated number of bytes sent by all operators.
At block 204, feature extraction is performed by the feature extractor 108. Feature extraction is the transformation of the features collected by the optimizer 104 into a set of features for classifying the queries 106. The features collected by the optimizer 104 are input to the feature extractor 108. The feature extractor 108 extracts and processes the relevant information and outputs a feature vector for a predetermined set of operators. The feature vectors are created for operators such as the Scan, the Group By, and the Join Operators. The feature extractor 108 also creates a query feature vector. The feature extraction process results in a richer set of attributes than the standard output provided by an optimizer. The output of feature extraction is input to the query classifier 112.
At block 206, the classifier 112 classifies the queries 106 based on the extracted features. In one example, the classification is performed using principles of Machine Learning.
At block 304, a matrix is created of features for all queries, such as described with respect to the feature extractor 108 of
At block 306, features with no variance are dropped from the matrix. The features without variance have no discernible impact on the queries' performance. At block 308, a machine learning method, such as principal component analysis (PCA), may be applied. Additionally, the number of dimensions to be used for a classifier, such as classifier 112 of
At block 312, the labels are added to the training set. The labels identify the potential classes for the queries. Using these labels, and the divided up data, the classifier is trained. At block 314, the trained classifier is used to predict and report results for classifying queries.
In an example implementation using the example system 100 described with respect to
Using the R-language, various classifiers were tested to classify the queries 106. The Random Forest classifier and the SVM classifier provided the best results. The collected features were transformed using the PCA and the independent component analysis (ICA) transformations. All the three transformations are examples of unsupervised learning, and were used to reduce variance, or to discover transformations that were statistically independent. The transformed feature data is fed to a supervised classifier, such as SVM or CART.
The results of three experiments are presented, using a workload of about 3,000 queries, varying in complexity from simple scans to 20-way joins. The elapsed times of these queries were collected, as well as the augmented optimizer output. In this example, the query level feature vector consists of about 40 features.
A classifier 112 was developed to classify queries 106 into two buckets: normal queries and anomalous queries. The randomly generated training set consisted of about 1,200 queries, and the testing set is randomly drawn from the remaining 1800 queries. A small set was also allocated for cross validation purposes, e.g., for tuning the training parameters in SVM. For the training set, a query was marked as anomalous if the elapsed time exceeded a predefine constant, or if the optimizer 104 marks the query 106 as very expensive, but the actual elapsed time duration is low. Each row of the training set consists of 40 features as well as the predicted “Y” value or the label. The “Y” value is the value being classified across some number of buckets. The classifier predicts where the “Y” value is likely to be placed. To train the classifier, data is collected, that includes the “Y” value label, in addition to the features. All 40 input features were scaled, and had the PCA algorithm applied. Using this approach, the 40 features were reduced to eight dimensions explaining 85% of variance in the data. Using the R language environment, the SVM and the CART classifiers were invoked to develop prediction models. Input rows from the testing set were used to perform the classification using the two models. The results from the two models are shown in Table 1.
F-Score is a commonly used metric in the machine learning community; a perfect learner has an F-Score of 1. Accuracy is the sum of True Positives and True Negatives divided by the total population.
Using the same workload, but classifying queries among three buckets: first bucket (Class 1) has queries that are quite fast where the elapsed time is less than 0.01 seconds. The third bucket (Class 3) consists of anomalous queries, as described above. The second bucket (Class 2) consists of the rest of the workload. With a randomly drawn training data set of 1,300 queries, a Random Forest of trees model was trained using the package “RandomForest” in the R language environment. With a randomly generated training set from the remaining 1,700 queries, the model was tested. The preliminary results are shown in Table 2, in the form of an accuracy table, where the rows are the actual results, and the columns are the predictions.
For example, the item in cell (1,1) shows that of all queries 106 that are in class 1 (actuals), 96% are classified as class 1 (predictions); the item in cell (2,3) shows that of all queries that are in class 3 (actuals), 8% are classified as class 2 (predictions). The values on the major diagonal are correctly predicted numbers divided by actuals. Values in other cells represent errors in the prediction.
In this experiment, ICA was used for the transformation and the SVM and the Random Forest packages were applied to the query features. ICA is another popular transformation tool for transforming data to learn about main components. In addition to PCA, the input data was also tested by converting it into statistically independent components using the ICA transformation. The performance of the PCA transformation was compared with that of the ICA transformation. A cross fold validation method was used by training models using random subsets of data repeatedly as follows: divide the input data into two random sets: a training set, and q testing set. The training set is randomly chosen to be 80% of the data, and the remaining 20% is testing data. The results are shown in Table 3.
These are averages over a 10-way run. They show that the Random Forest package works well for the example workload. However, the SVM used with ICA also shows promise.
In these examples, classification is based on expected query execution times. However, one could use different criteria for classification. For example, the classification could be based on the type of SQL operator that uses the most resources during execution of the SQL query, i.e., the dominant operator. For example, the possible classifications may include the JOIN operator types, the GROUPBY operator, and the parallelization operator. Such a classification could be useful to a QA engineer in designing a coverage test, or running relevant tests prior to a release. By classifying queries in this way, a coverage test may test the various types of data accesses performed by a system, instead of testing every single query. If there are thousands of queries to test, but there is a time limitation, then the classification on the dominant SQL operator may be used to design a test suite that tests a minimal set of queries satisfying the time limitation.
Alternatively, the dominant operator, or operators, may be used as features in this, or another, classification scheme. For example, classification according to timeout thresholds may be based on the top two dominant operator types in a query.
Each module 602, 604 may be electronic circuitry (i.e., hardware) that implements the functionality of the module. Each module may also include instructions (e.g., stored on a machine-readable storage medium of the system) that, when executed (e.g., by a processor of system), offer the functionality of the module.
The example system 700 can include clusters of database servers 702 having one or more processors 704 connected through a bus 706 to a storage 708. The storage 708 is a tangible, computer-readable media for the storage of operating software, data, and programs, such as a hard drive or system memory. The storage 708 may include, for example, a basic input output system (BIOS) (not shown).
In an example, the storage 708 includes a DBMS 710, which includes an optimizer 712. The storage 708 also includes a feature extractor 716, and a WMS tool 718, which includes a classifier 720. The server 702 can be connected through the bus 706 to a network interface card (NIC) 722. The NIC 722 can connect the database server 702 to a network 724 that connects the servers 702 of a cluster to various clients (not shown) that provide the queries. The network 724 may be a local area network (LAN), a wide area network (WAN), or another network configuration. The network 724 may include routers, switches, modems, or any other kind of interface devices used for interconnection. Further, the network 724 may include the Internet or a corporate network.
A predetermined plurality of features, generated by an optimizer, is reduced to a leaned model of features by using a machine learning method. Classification is performed based on features of the query and features of operators executed by the query. An execution classification is assigned to a query based on the learned model of features. The execution classification is associated with a pre-defined threshold for execution of the query.
Number | Date | Country | Kind |
---|---|---|---|
1947/CHE/2015 | Apr 2015 | IN | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2015/042506 | 7/28/2015 | WO | 00 |