The invention itself, as well as a preferred mode of use, further objects, and advantages thereof, will best be understood by reference to the following detailed description of an illustrative embodiment when read in conjunction with the accompanying drawings, wherein:
The present invention provides a method, system, and computer program product for searching for information within a database by utilizing an index lead key self-join operation to increase the efficiency, in both speed and space, of the database search.
With reference now to
Computer 102 is able to communicate with server 150 via network 128 using network interface 130, which is coupled to system bus 106. Network 128 may be an external network such as the Internet, or an internal network such as an Ethernet or a Virtual Private Network (VPN). Using network 128, computer 102 is able to access server 150.
Hard drive interface 132 is also coupled to system bus 106. Hard drive interface 132 interfaces with hard drive 134. In a preferred embodiment, hard drive 134 populates system memory 136, which is also coupled to system bus 106. System memory is defined as a lowest level of volatile memory in computer 102. This volatile memory may include additional higher levels of volatile memory (not shown), including, but not limited to, cache memory, registers, and buffers. Data that populates system memory 136 includes operating system (OS) 138 and application programs 144.
OS 138 includes shell 140, for providing transparent user access to resources such as application programs 144, and kernel 142, which includes lower levels of functionality for OS 138, including providing essential services required by other parts of OS 138 and application programs 144, including memory management, process and task management, disk management, and mouse and keyboard management.
In one embodiment, application programs 144 in system memory 136 include database management system 146. As described further below, database management system 146 may be utilized to implement the process depicted in
The hardware elements depicted in computer 102 are not intended to be exhaustive, but rather are representative to highlight certain components that mat be utilized to practice the present invention. For instance, computer 102 may include alternate memory storage devices such as magnetic cassettes, Digital Versatile Disks (DVDs), Bernoulli cartridges, and the like. These and other variations are intended to be within the spirit and scope of the present invention.
Within the descriptions of the figures, similar elements are provided similar names and reference numerals as those of the previous figure(s). Where a later figure utilizes the element in a different context or with different functionality, the element is provided a different leading numeral representative of the figure number (e.g., 1xx for FIG. 1 and 2xx for
Conventional index search involves a starting position and a stopping position. The starting position and stopping position are identified via index positioning with a start key value and a stop key value, derived from query predicates on the index key columns. An index search starts at the starting position and reads through all items in the index until the stopping position is reached. Therefore the range between the starting position and the stopping position, when compared with the entire range of the index, is an indication of the efficiency of the index search. An index search where only a small portion of the index needs to be accessed is more efficient than an index search where a large portion of the index must be accessed. The efficiency of an index search is determined by the selectivity of the range predicates on the index key column. A composite index has multiple index key columns. In the case of a composite index, the efficiency of an index search is largely determined by the selectivity of the range predicates on the leading key column of the index. For example, with an index defined on columns (c1, c2), and range predicates on both columns “c1” and “c2”, the efficiency of the index search is largely determined by the selectivity of the range predicates on column “c1”. However, in cases where column “c1” has an equality predicate in the form of “c1=constant”, then column “c1” is considered to be “bound” to a constant value, and the efficiency of the index search is then determined by the range predicates on the next index key column, “c2”.
When there are range predicates on both column “c1” and column “c2”, the efficiency of an index search is largely determined by the selectivity of the predicates on column “c1”. If a column contains predicates that have many different unique values, the column is defined as having good selectivity. Columns that contain predicates that have fewer unique values, with respect to other columns, are defined as having poor selectivity. If the predicates on column “c1” have poor selectivity, but the predicates on column “c2” have good selectivity, the efficiency of the index search is poor due to poor selectivity of predicates on column “c1”. Since a user of the index search can only take advantage of the (good) selectivity of column “c2” when column “c1” is bound to a constant value, a method and system are need to convert the predicates on column “c1” into equality predicates.
With reference now to
With reference now to
After designation of leading index key columns at block 310, an index scan is used at block 315 to retrieve all unique values for the designated leading index key columns. The set of leading index key columns should have a limited number of unique values, which implies each unique value of the leading index key columns represents a large number of duplicating index items having the same value of leading index key columns. Therefore, to retrieve each unique value of leading index key columns, it is beneficial to reposition the index scan instead of sequentially traversing the index items and discard duplicate items having the same value of leading index key columns. The repositioning of the index scan includes an index positioning operation using the current value of leading index key columns, and requesting the next value beyond the current value.
At block 320, the process enters a loop, in which each unique value of the index lead key columns is used to drive the loop. For each unique value of the index lead key columns, an index search is performed at block 325, utilizing self-join predicates on the index lead key columns and predicates on the non-lead-key columns of the index. After the index search operation at block 325, a determination is made at block 330 to check whether any additional unique values for the index lead key columns exist. A positive response to the determination at block 330 results in a return to block 320, where the next unique value of index lead key columns is selected. A negative response to the determination at block 330 results in the termination of the process at block 335.
With reference to
In accordance with an embodiment of the invention, a query is entered for sales of item A on Mondays through Wednesdays. Database management system 146 first determines whether index lead key self-join is beneficial for this query. Since column “C1” has only 3 unique values (MONDAY, TUESDAY, and WEDNESDAY) that satisfy the query predicates, and predicates on column “C2” have good selectivity (1/26), index lead key self-join is beneficial for this query. Therefore an index lead key self-join method will be used, with column “C1” as the index lead key column.
The first unique value of column “C1” that satisfies the query predicates is MONDAY (C1=MONDAY), thus an index search with predicates “C1=MONDAY and C2=A” is performed to retrieve the first batch of results, with “C1=MONDAY” as the self-join predicate. Subsequently the next unique value for index lead key column “C1” (TUESDAY) is used in an index search with predicates “C1=TUESDAY and C2=A” to retrieve the next batch of results. The last unique value of column “C1” that satisfies the query predicates is WEDNESDAY, which is used in an index search with predicates “C1=WEDNESDAY and C2=A” to retrieve the last batch of results of the query. Consequently, the index lead key self-join effectively transforms the original query (C1 between MONDAY and WEDNESDAY, and C2=A) into its equivalent form (C1=MONDAY and C2=A UNION C1=TUESDAY and C2=A UNION C1=WEDNESDAY and C2=A) 415. The index lead key self-join method thus avoids accessing regions of the index that do not satisfy the query predicates, and thereby avoids utilizing additional memory and/or time to search through items in the index that do not satisfy the query predicates.
Although the above example includes a single column (column C1) as an index lead key column, in an alternate embodiment multiple index key columns may be utilized as index lead key columns.
It is understood that the use herein of specific names are for example only and not meant to imply any limitations on the invention. The invention may thus be implemented with different nomenclature/terminology utilized to describe the above devices/utility, etc., without limitation.
The present invention thus presents a method, system, and computer-readable medium for searching for information within a database by identifying a set of leading index key columns, which has a limited number of unique values, and utilizing the set of leading index key columns to perform an index lead key self-join operation. The index lead key self-join operation may be performed such that the desired range of the database search may be logically converted into a plurality of smaller ranges within the index combined via a plurality of UNION operations. The index lead key self-join operation thereby avoids accessing the ranges of the index that are not relevant to the query, and only access a plurality of smaller ranges that satisfy the query predicates, thus improving query performance and avoiding excess memory/time requirements.
Note that while an illustrative embodiment of the present invention has been, and will continue to be, described in the context of a fully functional computer system with installed software, those skilled in the art will appreciate that the software aspects of an illustrative embodiment of the present invention are capable of being distributed as a program product in a variety of forms, and that an illustrative embodiment of the present invention applies equally regardless of the particular type of signal bearing media used to actually carry out the distribution. Examples of signal bearing media include recordable type media such as thumb drives, floppy disks, hard drives, CD ROMs, DVDs, and transmission type media such as digital and analogue communication links.
While the invention has been particularly shown and described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention.