1. Technical Field
The field is generally related to data processing. More specifically, the field is related to methods, apparatus, and products for executing database rollup queries.
2. Description of Related Art
The development of the EDVAC computer system of 1948 is often cited as the beginning of the computer era. Since that time, computer systems have evolved into extremely complicated devices. Today's computers are much more sophisticated than early systems such as the EDVAC. Computer systems typically include a combination of hardware and software components, application programs, operating systems, processors, buses, memory, input/output devices, and so on. As advances in semiconductor processing and computer architecture push the performance of the computer higher and higher, more sophisticated computer software has evolved to take advantage of the higher performance of the hardware, resulting in computer systems today that are much more powerful than just a few years ago.
Information stored on a computer system is often organized in a structure called a database. A database is a grouping of related structures called ‘tables,’ which in turn are organized in rows of individual data elements. The rows are often referred to as ‘records,’ and the individual data elements are referred to as ‘fields.’ In this specification generally, therefore, an aggregate of fields is referred to as a ‘data structure’ or a ‘record,’ and an aggregate of records is referred to as a ‘table.’ An aggregate of related tables is called a ‘database.’
A computer system typically operates according to computer program instructions in computer programs. A computer program that supports access to information in a database is typically called a relational database management system or an ‘RDBMS.’ An RDBMS is responsible for helping other computer programs access, manipulate, and save information in a database.
An RDBMS typically supports access and management tools to aid users, developers, and other programs in accessing information in a database. One such tool is the structured query language, ‘SQL.’ SQL is query language for requesting information from a database. Although there is a standard of the American National Standards Institute (‘ANSI’) for SQL, as a practical matter, most versions of SQL tend to include many extensions. Here is an example of a database query expressed in SQL:
This SQL query accesses information in a database by selecting records from two tables of the database, one table named ‘stores’ and another table named ‘transactions.’ The records selected are those having value “Minnesota” in their store location fields and transactions for the stores in Minnesota. In retrieving the data for this SQL query, an SQL engine will first retrieve records from the stores table and then retrieve records from the transaction table. Records that satisfy the query requirements then are merged in a ‘join.’
RDBMS software typically has the capability of analyzing data based on particular columns of a table. For example, rows can be grouped according to columns defined in a GROUP BY clause of a query. The column names in a SELECT clause are either a grouping column or a column function. Column functions return a result for each group defined by the GROUP BY clause.
A grouping query can include a standard WHERE clause that eliminates non-qualifying rows before the groups are formed and the column functions are computed. A HAVING clause eliminates non-qualifying rows after the groups are formed; it can contain one or more predicates connected by ANDs and ORs, wherein each predicate compares a property of the group (such as AVG(SALARY)) with either another property of the group or a constant.
The GROUPING SET operator extends the GROUP BY operation to simultaneously specify the computation of multiple GROUP BYs in a single GROUP BY operation. When the GROUPING SET operator is used, a NULL value in a non-null grouping column denotes that the particular column is collapsed in the aggregate. If a grouping column (c) is nullable, a GROUPING operator (GROUPING(c)) is required to distinguish between the NULL group and a column collapsed in the aggregate. Used in conjunction with GROUPING SETS, the GROUPING operator returns a value which indicates whether or not a row returned in a GROUP BY answer set is a row generated by a GROUPING SET that excludes the column represented by the expression. The argument can be of any type, but must be an item of a GROUP BY clause. The result of the function is set to one of the following values: 1—The value of expression in the returned row is a null value, and the row was generated by a super-group. That is, the argument is collapsed in the aggregate. 0—The value of the expression in the returned row represents a non-system generated value of the group (which may be null and indicates that the argument is not collapsed in the aggregate.
ROLLUP operations can also be specified in the GROUP BY clause of a query. ROLLUP operations are shorthand for GROUPING SETS that represent common sets of GROUP BY operations that are required for common queries for online analytical processing (OLAP). ROLLUP grouping produces a result set containing the regular grouped rows and sub-total rows. For example, ROLLUP can provide the sales by person by month with monthly sales totals and an overall total. It should be noted that where the term “column” is used, it really could be any expression. In one example, it is common to use ROLLUP as follows: a table has a column “date” and the ROLLUP clause is GROUP BY ROLLUP (year (date), date).
May business intelligence applications involve a hierarchical, multi-dimensional aggregate view of the data. The simplest form of this is the current “group by” support which aggregates data along one dimension: e.g. state, county, city, in the following example:
select state, county, city, sales(*) from Table group by state, county, city.
In a database having columns for state, county, city, and sales, users often analyze the data in multiple ways such as further aggregate on the state and city (group by state, city) or aggregate on the overall total (no group by clause, whole file aggregate). In addition, the user may like to aggregate from a different perspective e.g. (group by county, city). With only the group by clause available, all these disparate pairings require different succinct queries. That's where the grouping sets and super groups SQL syntax comes in; grouping sets and super groups allow a user to aggregate in multiple ways in one query: This often requires performing multiple aggregates.
select state, county, city, sales(*) from Table group by grouping Sets((rollup(state, county, city)), (state, county), (county)).
What would be desirable is a method for performing the multiple aggregates in a more efficient manner.
Methods, systems, and computer program products are provided for executing database rollup queries. Embodiments typically include iterating through a database table which has been grouped and ordered on at least two different columns. This grouping may be done through use of SQL GROUP BY and ORDER BY statements. Some embodiments of the invention provide an efficient implementation of SQL GROUP BY ROLLUP statements. Data is sorted over all columns (expressions) in the ROLLUP clause. E.g. if there are three columns in the ROLLUP clause, data is sorted over all three columns.
Various aggregate functions can be performed on the date. One aggregate function is the summation function or sum, used in the present application for illustration purposes. In some embodiments, the GROUP BY ROLLUP construct can be executed while only requiring an additional one storage location per ordered column per each aggregate function to be stored for each database row. The higher level column aggregate functions can be executed without relying on accessing any lower level column aggregate results in some embodiments. In other embodiments, the higher level column aggregate functions can be executed while relying on no more than one lower level column aggregate function result. A suitably grouped and ordered database table can have a multiple level hierarchical ROLLUP function executed in a single pass without having to retrieve lower level column aggregate results.
The query described in the background section defines 4 grouping sets (state, county, city), (state, county), (state), and ( ). Other prior art DB Engines perform the 4 aggregates in 4 separate legs, using up to 4 temporary result sets. Some embodiments of the present invention provide the ability for the RDMS to perform the four aggregates hierarchically using at most 1 temporary result set. One example utilizes a MultiAggregate node which takes as input, data from the table sorted by state, county, city. All four grouping sets with aggregate can be computed by stacking each individual aggregate where one level may use the result sets from the lower level. The MultiAggregate node can use the results from each grouping stage to aggregate the next grouping levels.
One embodiment of the invention provides a method for performing a rollup function on a database table, wherein the database table is grouped by and ordered on at least a first level column and a second level column, where as between the first level column and second level column the database table is grouped by and ordered primarily on the second level column and secondarily on the first level column. The database table also has a quantity column upon which an aggregate function can be performed to produce an aggregate result. This embodiment method includes generating first level aggregate results for at least one first level column value, generated responsive to a change in the first level column values between successive rows. The method also includes generating second level aggregate results for at least one second level column value, generated responsive to a change in the second level column values between successive rows, wherein generating each second level aggregate result does not require more than one first level aggregate result. In some methods, generating each second level aggregate result does not require accessing any first level aggregate result.
Some embodiment methods also include outputting the first level aggregate results; and outputting the second level aggregate results. Methods may also include generating an overall aggregate result responsive to the data ending, in which the overall aggregate result includes an aggregate for all rows in the database table.
In some embodiments, the database table is further grouped by and ordered on at least a third level column, where as between the second level column and the third level column the database table is grouped by and ordered primarily on the third level column and secondarily on the second level column. The embodiment method can further include generating third level aggregate results for at least one third level column value, generated responsive to a change in the third level column values between successive rows, wherein generating each third level aggregate result does not require accessing more than one second level aggregate result. In some embodiments methods, generating each third level aggregate result does not require accessing any second level aggregate result.
In some embodiments of the invention, an aggregate store exists for each level, and the aggregate store for each level is refreshed responsive to accessing a new row having a quantity value. The aggregate store for each level is refreshed from an immediately lower level aggregate store prior to the lower level aggregate store resetting in some methods. Methods according to the present invention may include iterating through a previously grouped and sorted database table and/or a previously indexed database table.
The present invention also provides systems for processing database queries, where such a system can include a computer processor and a computer memory operatively coupled to the computer processor. The computer memory can have disposed within it computer program instructions capable of performing the various methods according to the present invention.
Computer program products for processing database queries are also provided by the present invention. Such a computer program product can be disposed in a computer readable signal bearing medium, the computer program product comprising computer program instructions capable of performing the various methods according to the present invention.
The foregoing and other benefits, features and aspects of the invention will be apparent from the following more particular descriptions of exemplary embodiments of the invention as illustrated in the accompanying drawings wherein like reference numbers generally represent like parts of exemplary embodiments of the invention.
Operators of the client systems 102 can use a standard operator interface 108, such as IMS/DB/DC CICS, TSO, OS/2 or other similar interface, to transmit electrical signals to and from the server system 100 that represent commands for performing various search and retrieval functions, termed queries, against the databases. In some embodiments of the present invention, these queries may conform to the Structured Query Language (SQL) standard, and invoke functions performed by Relational DataBase Management System (RDBMS) software. In one embodiment of the present invention, the RDBMS software comprises the DB2 product offered by IBM for the MVS, UNIX, WINDOWS or OS/2 operating systems. Those skilled in the art will recognize, however, that the present invention has application to any RDBMS software.
As illustrated in the example of
At the heart of the DB2 architecture is the Database Services module 114. The Database Services module 114 contains several submodules, including the Relational Database System (RDS) 116, the Data Manager 118, and the Buffer Manager 120, as well as other elements such as an SQL compiler/interpreter. These submodules support the functions of the SQL language, i.e., definition, access control, retrieval, and update of user and system data.
Generally, each of the components, modules, and submodules of the RDBMS software comprise instructions and/or data, and are embodied in or retrievable from a computer-readable device, medium, or carrier, e.g., a memory, a data storage device, a remote device coupled to the server computer 100 by a data communications device, etc. Moreover, these instructions and/or data, when read, executed, and/or interpreted by the server computer 100, cause the server computer 100 to perform the steps necessary to implement and/or use the present invention.
Thus, the present invention may be implemented as a method, apparatus, or article of manufacture using standard programming and/or engineering techniques to produce software, firmware, hardware, or any combination thereof. The term “article of manufacture”, or alternatively, “computer program product”, as used herein is intended to encompass a computer program accessible from any computer-readable device, carrier, signal, or media.
Of course, those skilled in the art will recognize many modifications may be made to this configuration without departing from the scope of the present invention. Specifically, those skilled in the art will recognize that any combination of the above components, or any number of different components, including computer programs, peripherals, and other devices, may be used to implement the present invention, so long as similar functions are performed thereby.
In
In some embodiments of the invention, the table of
In the GROUP BY ROLLUP function which is described in more detail below, the rows may be viewed as being iterated through or indexed through. Within this looping through the rows, the levels 1, 2, and 3 may also be viewed as being indexed through as the aggregate function is being calculated on the data base table data. In most embodiments, the row will be increased continually in an outer loop as the method progresses through the table. In some embodiments, the current column level being iterated can increase from one to two to three and then back again to one in an inner loop. This can happen repeatedly throughout the table analysis.
Some embodiments of the present invention depend upon the table being GROUPED BY and SORTED according to the levels to be operated upon by the GROUP BY ROLLUP function. This grouping and ordering can be the result of the temporary output generated by other SQL commands or can reflect the actual storage state of the database. As previous discussed, this can also be accomplished using indices or other sorting methods well known to those skilled in the database art.
In some embodiments of the present invention, the current level being operated on is initialized to the lowest level, for example, 1 which corresponds here to the city column. The value of the aggregate functions, for example, the sums, can be initialized to zero for each level and, in some embodiments, also to zero for a super aggregate function value. This super aggregate function value can be, for example, the total of all the quantity values present in the quantity column for the entire table. In this example of the invention, there would be a super level containing the sum of all the sales in the entire table.
The table
At this point in the algorithm, in some embodiments, the algorithm looks ahead (at line 5 in
As the value for level 1, the city, will not be the same in the next row, then the method can generate and output a row having the sum for the current level shown. In this example, the aggregate function value or the sum value for the city column is the same as that for Edina, which is 100,000, as there was only one city before the next city would change. In addition, in some embodiments of the invention, rows generated and output can also include a null value for all columns at and below the level of the current level column. As the current level column here is 1, there are no such lower-level columns and only the current level column, the level 1 column for cities is a null. In
In line 6 of the
Row 2 is now under consideration. The aggregate value for city has been previously reset to zero, and now has the value of 150,000 added to it for the city of Minneapolis. Looking ahead (in line 4) to the next row's column value for level 1, the city level, we see there will be no change, therefore the row is incremented and the 300,000 quantity is added to the city level 1 aggregate value. With row 4 being inspected, the city value will indeed change for level 1, and therefore the logic at line 5 can be executed, outputting the current city level aggregate of 450,000, along with a null indicator for the city column. The city level aggregate can then be zeroed at line 6. The current level under consideration can be incremented at line 7 of the pseudo code, forcing the logic to also peek ahead to the county value for row 4, which is also going to change. Therefore the output function at line 5 is executed, out putting a second aggregate row labeled as row 2-1 in
In evaluating row 4, the 440,000 value for Rochester is added to the city, county, and state aggregate values. In looking ahead to row 5, the city in level 1 will change, forcing the city aggregate of 440,000 to be output along with a null indicator for the city as previously described. As the county will change from Olmsted to Ramsey, this will also force a county totalization which is also 440,000. A null indicator is output for both county and city columns, as previously described. With row 5 being evaluated, there is one and only one value for the city of Shoreview, forcing a city totalization to output at
Row 7 can bow be processed, followed by row 8, generating a city aggregate output row at 1--6 for Auburn, and a county output for Cayuga County at 2-4. Rows 9 and 10 can be processed, generating a total for the city of Buffalo at row 1--7. Processing continues, adding a city total for Cheektowaga at row 1--8 and a total for the city of Tonawanda at row 1--9. A total for Erie County may be seen at row 2--5. City and county totals for Skaneateles (city) and Onondaga County are generated at row 1--10 and row 2-6, respectively. After generating county and city totals for Wayne County and the city of Red Creek at rows 1--1 and 2--7, a look ahead to row 17 shows that the state will change as well. This generates a row output at row 3--2 for the state of New York as well.
The processing for the state of Wisconsin continues for rows 17 through 22, forcing a city output for Milwaukee at row 1--16, a county output for the County of Milwaukee at row 2-9, and a state output for the state of Wisconsin at row 3-3.
At this point in the processing, all the rows in the table have been processed, reaching execution of the pseudo-code line 13. This line can simply output the aggregate values for the super aggregate which may be considered as a level 4 aggregate, together with the null indicators for state, county, and city, as seen in row 4-1. This super aggregate can be added to every time a new row is processed, or at an appropriate time in the logic before the state aggregate is reset to zero, which would happen at the point of a change in the state column value in some embodiments.
The aggregate values, here the sum values, have been intermixed with the table of
Inspection of
In various embodiments of the present invention, various algorithms may be used. In one embodiment, the higher level aggregate storage values are incremented or added to only when the lower level aggregates are going to be set to zero. In some embodiments, as one previously described, the higher level aggregates can be added to or changed when obtaining a new quantity column value, retrieved for every row having a non-null value. Various methods can be used to indicate that the output row is a level 1 aggregate output, or a level 2 or a level 3 aggregate output. Some embodiments of the invention, as illustrated with respect to
Other embodiments of the invention utilize methods which obtained a row column value and then look back rather than look ahead to detect a change in column value. Some embodiments operate on input tables which are grouped by and sorted for the rollup levels, and also have extra, extraneous data within. In the example of
The summation function has been used as an example with respect to