Databases are collections of data or information that may be organized according to various parameters. Each database may include one or more tables of data, each table including zero or more rows and one or more columns of data. Each row represents a record, and each column represents a field, or attribute.
For a more complete understanding of various examples, reference is now made to the following descriptions taken in connection with the accompanying drawings in which:
In various examples, histogram statistics for database tables may be updated in an efficient manner. In this regard, a base table of a database may be sampled to create a sample table which represents a random sampling of the base table. In one example, the base table may include a large number of rows (e.g., one billion), and the sample table may be created by sampling the base table at a sample rate (e.g., 1/1000). Thus, the sample table may be substantially smaller than the base table and require fewer resources (e.g., processor time) to compute histogram statistics. In this regard, in one example, statistics may be computed on the sample table and extrapolated, using statistical methods, to reflect approximate statistics of the base table. As the base table gets changed over time, the histogram statistics may potentially become more and more inaccurate and therefore should be updated occasionally. In various examples, a predicate that describes a superset of potentially changed (inserted, updated, or deleted) rows in the base table may be used in updating of the sample table. In one example, rows satisfying the predicate are deleted from the sample table, and a set of rows representing a random sampling of rows satisfying the predicate in the base table may be inserted into the sample table. Thus, the sample table may be updated to represent a random sampling of the base table without sampling the entire base table again. In various examples, the updated sample table may be used to generate updated histogram statistics.
In various examples, database systems may optimize query plans using histogram statistics. Various examples described herein may update histogram statistics in an efficient limner. Histogram statistics for a table may be determined by dividing the rows of a table into ranges of values, or intervals, for a field. In various examples, each histogram interval has approximately the same number of rows. For each histogram interval, a unique entry count (UEC) may be determined and maintained. Various other statistics for each histogram interval may also be maintained. The histogram statistics may be useful in providing various characteristics of the database or the database tables, and the statistics may be useful in optimizing query plans.
The histogram statistics may be updated regularly or at select times. For large databases or large tables, the updating may be highly resource intensive. Various examples described herein allow updating of histogram statistics in an efficient manner.
Referring first to
The example system 100 may further include a base table 120 which may include various rows and columns, for example. The columns may represent various fields, or attributes, of the base table 120 or a database. In various examples, the base table 120 may be part of a database 130 which may be stored on a memory device (not shown). The database 130 may include any number of tables, each of which may include various rows and columns. In various examples, the base table 120 may include a large number of rows. For example, in an enterprise environment, an example base table 120 may include rows numbering in the millions, hundreds of millions, billions, or more.
In various examples, the system 100 may include a sample table 140 and various statistics 150. In one example, the sample table 140 may be representative of a random sampling of the base table 120 and may include fewer rows than the base table 120. The number of rows in the sample table 140 may depend upon a sampling rate used to generate the sample table 140. In various examples, the sampling rate may be 1/1000, 1/100, or any other selected rate. The sampling rate may be selected to balance efficiency of updating histogram statistics e.g., reducing the size of the sample table) and accuracy of random representation.
The statistics 150 may include histogram statistics. In various examples, various other statistics may also be included. For example, the statistics 150 may include skew elements associated with the histogram statistics. In the example system 100 of
Referring now to
In accordance with various examples described herein, the complete base table 210 may be sampled only when initially creating the sample table 220. Thereafter, while the base table 210 may be updated due to, for example, addition, deletion, or changing of data in the base table (e.g., as illustrated in the example of
In various examples, the sample table may be updated when, for example, an incremental update statistics (IUS) algorithm may be executed. Various examples may execute the algorithm at varying frequencies which may depend on the purpose of the update or the type of database, for example. In one example, the database may include information related to sales transactions, and statistics may be updated on a daily basis.
In various examples, a predicate may be used to efficiently update the sample table and, as described below with reference to
In various examples, as illustrated in the example of
The set of rows 230 from the base table 210 satisfying the predicate may be sampled to obtain a set of rows 240 for insertion into an updated sample table 260. In various examples, the sampling of the set of rows 230 may be performed in accordance with the sampling used to initially generate the sample table 220. For example, a uniform sampling rate (e.g.,) 1/1000) may be used for sampling in both cases.
In various examples, the same predicate applied to the base table 210 to produce the set of rows 230 may be applied to the sample table 220. For example, a set of rows 250 may be generated corresponding to rows from the sampled table 220 which reflect a sales transaction in the past seven days. This set of rows 250 may be the set of rows that are to be deleted from the sample table 220 for updating of the sample table.
An updated sample table 260 may be generated by deleting the set of rows 250 from the sample table 220 satisfying the predicate and inserting the set of rows 240 obtained from sampling of the set of rows 230 satisfying the predicate applied to the base table 210. In various examples, the updated sample table 260 may replace the sample table 220 for the subsequent update without the need for again sampling the entire base table 210.
In various examples, each of the tables and sets of rows 210-260 may be retained in a persistent memory. For example, the set of rows for inserting 240 and the set of rows for deletion 250 may be retained for statistical purposes.
Referring now to
Upon an indication of an execution of an incremental update statistics (IUS) operation (block 312), a counter (i) may be incremented (block 314). The counter (i) may be initially set to 0 and may be used to track historical statistics, for example.
In various examples, rows in the sample table (Si-1) which satisfy a predicate (pi) (e.g., the set of rows 250 in
In various examples, rows from the base table satisfying the predicate (e.g., the set of rows 230 in
The sampled rows from the base table satisfying the predicate may be added to the sample table to create an updated sample table (block 324). Thus, an updated sample table may be created.
In various examples, the rows of the sample table may be divided into updated histogram intervals (block 326). In this regard, as noted above, the histogram statistics for the sample table may be determined by dividing the rows of the sample table into ranges of values, or intervals, for a field. In various examples, each histogram interval has approximately the same number of rows. For each histogram interval, a unique entry count (UEC) may be determined and maintained (block 328). In various examples, the histogram intervals may be maintained at the same boundaries, and the rows in the sample table may be reapportioned to the existing histogram intervals.
In various examples, counting Bloom filters may be used to accelerate the determination of UECs. Bloom filters may be used for tracking data set membership. Counting Bloom filters (CBFs) are a type of Bloom filter which may also be used to remove a data set from the Bloom filter. In this regard, in various embodiments, CBFs may be used to maintain the frequency information for each histogram interval.
An example use of CBFs is described here with reference to
At block 328, in various examples, the updating of the UECs may be accelerated using the CBFs. In one example, the CBFs may be used to estimate frequencies of frequencies, which may be a measure of the occurrence of various frequencies. In this regard, a frequency of a value may be the number of times it occurs in the sample table, for example. In various examples, frequencies of frequencies may be used to estimate the UECs. In this regard, a frequency of frequencies may be the number of values that have a particular frequency.
In various examples, following the updating of UECs (e.g., block 328 of
Thus, various examples allow updating of histogram statistics without resource-intensive analysis of a large base table. In various examples, use of a predicate to update a sample table that is maintained in a persistent memory allows efficient updating of the sample table and of histogram statistics. Further, in various examples, CBFs may be used to accelerate computation of UECs in the histogram statistics.
Various examples described herein are described in the general context of method steps or processes, which may be implemented in one example by a software program product or component, embodied in a machine-readable medium, including executable instructions, such as program code, executed by entities in networked environments. Generally, program modules may include routines, programs, objects, components, data structures, etc. which perform particular tasks or implement particular abstract data types. Executable instructions, associated data structures, and program modules represent examples of program code for executing steps of the methods disclosed herein. The particular sequence of such executable instructions or associated data structures represents examples of corresponding acts for implementing the functions described in such steps or processes.
Software implementations of various examples can be accomplished with standard programming techniques with rule-based logic and other logic to accomplish various database searching steps or processes, correlation steps or processes, comparison steps or processes and decision steps or processes.
The foregoing description of various examples has been presented for purposes of illustration and description. The foregoing description is not intended to be exhaustive or limiting to the examples disclosed, and modifications and variations are possible in light of the above teachings or may be acquired from practice of various examples. The examples discussed herein were chosen and described in order to explain the principles and the nature of various examples of the present disclosure and its practical application to enable one skilled in the art to utilize the present disclosure in various examples and with various modifications as are suited to the particular use contemplated. The features of the examples described herein may be combined in all possible combinations of methods, apparatus, modules, systems, and computer program products.
It is also noted herein that while the above describes examples, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications which may be made without departing from the scope as defined in the appended claims.