The present invention generally relates to the field relational databases and, more specifically, to systems and methods pertaining to an “SQL Algebrizer” (or, more simply, and “Algebrizer”) for transforming SQL syntax tree representations (“SQL Tree”) of relational database queries (“SQL Text”) into relational algebra representations (“QP Algebra”).
Relational algebra is a formal mathematical notation that allows for expressing requests to relational databases in a strict and unambiguous way. Certain SQL server components (such as a query optimizer, for example) require the relational database requests be expressed in relational algebra and, more specifically, in QP Algebra (defined later herein).
Queries are made to a database in the form of SQL Text which, when parsed, is converted into a SQL Tree that must be transformed into QP Algebra for processing by as SQL Query Processor (“QP”). As known and appreciated by those of skill in the art, the task of transforming a SQL Tree to QP Algebra is performed by an algebrizer.
In general, an algebrizer determines if a relational database query—SQL Text that has been parsed/converted into a SQL Tree—is semantically correct and if so, transforms the SQL Tree into QP Algebra (a specific form of relational algebra understandable to a QP). One approach is for an algebrizer to process a SQL Tree recursively in a depth-first fashion by making one pass per “algebrizing” operation. However, while an algebrizer typically performs more than one distinct operation to “algebrize” a syntax tree representation of a relational database query into a relational algrebra representation, a typical algebrizer does not typically do any constant folding, which is an operation that is usually performed by the QP (and discussed in more detail later herein).
What is needed is an algebrizer that not only processes a SQL Tree using a reduced number of passes, but also one that performs constant folding so that it is no longer necessary for the QP to perform this task.
The SQL Algebrizer of the present invention comprises a plurality of algorithms to “algebrize” SQL Trees to QP Algebra using an approach that is more consistent and much more efficient than typical algebrizers. More specifically, the Algebrizer of the present invention processes a SQL Tree using a reduced number of recursive depth-first passes by performing multiple operations in a single pass. Furthermore, the Algebrizer of the present invention also performs the operation of constant folding in this single, multi-operation pass so that the QP, upon receiving the QP Algebra, needs not perform this operation at all.
More specifically, the following steps are performed in a single pass through the SQL Tree for various embodiments of the present invention:
The foregoing summary, as well as the following detailed description of preferred embodiments, is better understood when read in conjunction with the appended drawings. For the purpose of illustrating the invention, there is shown in the drawings exemplary constructions of the invention; however, the invention is not limited to the specific methods and instrumentalities disclosed. In the drawings:
The inventive subject matter is described with specificity to meet statutory requirements. However, the description itself is not intended to limit the scope of this patent. Rather, the inventor has contemplated that the claimed subject matter might also be embodied in other ways, to include different steps or combinations of steps similar to the ones described in this document, in conjunction with other present or future technologies. Moreover, although the term “step” may be used herein to connote different elements of methods employed, the term should not be interpreted as implying any particular order among or between various steps herein disclosed unless and except when the order of individual steps is explicitly described.
Computer Environment
Numerous embodiments of the present invention may execute on a computer.
As shown in
A number of program modules may be stored on the hard disk, magnetic disk 29, optical disk 31, ROM 24 or RAM 25, including an operating system 35, one or more application programs 36, other program modules 37 and program data 38. A user may enter commands and information into the personal computer 20 through input devices such as a keyboard 40 and pointing device 42. Other input devices (not shown) may include a microphone, joystick, game pad, satellite disk, scanner or the like. These and other input devices are often connected to the processing unit 21 through a serial port interface 46 that is coupled to the system bus, but may be connected by other interfaces, such as a parallel port, game port or universal serial bus (USB). A monitor 47 or other type of display device is also connected to the system bus 23 via an interface, such as a video adapter 48. In addition to the monitor 47, personal computers typically include other peripheral output devices (not shown), such as speakers and printers. The exemplary system of
The personal computer 20 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 49. The remote computer 49 may be another personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the personal computer 20, although only a memory storage device 50 has been illustrated in
When used in a LAN networking environment, the personal computer 20 is connected to the LAN 51 through a network interface or adapter 53. When used in a WAN networking environment, the personal computer 20 typically includes a modem 54 or other means for establishing communications over the wide area network 52, such as the Internet. The modem 54, which may be internal or external, is connected to the system bus 23 via the serial port interface 46. In a networked environment, program modules depicted relative to the personal computer 20, or portions thereof, may be stored in the remote memory storage device. It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers may be used. Moreover, while it is envisioned that numerous embodiments of the present invention are particularly well-suited for computerized systems, nothing in this document is intended to limit the invention to such embodiments.
Relational Databases
Many modern database systems, and specifically those based on the relational model, store data in the form of tables. A table is a collection of data organized into rows and columns.
Relational algebra is a formal mathematical notation that allows for expressing requests to relational databases in a strict and unambiguous way. Certain SQL server components (such as a query optimizer, for example) may require the relational database requests be expressed in relational algebra.
Relational algebra comprises a plurality of operations that can be performed on tables and, for example, includes a set of operators that take one or more tables as operands and produce a new table as a result. One important operation in the relational algebra is the “Cartesian product.” Cartesian product is a binary operation that takes two tables as operands and produces a third table as a result. The Cartesian product of two tables R and S (written as R×S) is formed by pairing each row of R with all the rows of S.
Usually the information sought from a database system is not the entire Cartesian product of two tables, but rather selected rows of the Cartesian product. In this case, a “join” of the two tables may be performed. A join is the Cartesian product of two tables, where particular rows of the resulting Cartesian product are selected according to a predicate. Specifically, the join of two tables RpS is those rows of R×S that satisfy the predicate P.
The join operation demonstrated in
For general information on relational databases, see J. D. Ullman, Principles of Database and Knowledge-Base Systems, vol. 1 (W. H. Freeman & Co., 1988).
Typical Algebrizing
In general, a typical algebrizer receives a syntax tree (hereinafter, a “SQL Tree”) from the parser, checks and transforms the tree in a series of steps, and then generates (via “tree translation”) a QP Algebra tree as its output. (QP Algebra is a specific form of relational algebra in tree form understandable by a SQL Query Processor or “QP”.) An algebrizer is necessary because the input syntax tree (the SQL Tree) not only lacks many adornments (such as type information) that are expected by the Query Processor, but the SQL Tree also utilizes certain operators that are not understandable by (nor intended for) the QP (e.g., operators that are specific to the algebrizer). A typical algebrizer operates by making several distinct passes through the SQL Tree where each such pass walks the entire statement tree in a depth-first fashion.
Table and Column Combining 416 is where every table and column name in the query is decorated with a reference to the corresponding column definition object (CValRef), and names representing the same object get the same reference. In this manner, the names themselves can be replaced with identical ValRef pointers when forming the QP Algebra (reflecting an optimized tree structure). The column definition objects are then initialized from the catalogs, and a check is made to ensure that every name in the query actually refers to a valid table or column that exists in the system catalogs and is visible within the particular query scope.
Aggregate Binding 418 is a necessary step because, in QP Algebra, an aggregate may only be a child of a GbAgg relational operator, whereas in SQL (e.g., SQL Text) aggregates are freely used in many contexts where a scalar expression is allowed. GbAgg operators, on the other hand, correspond to QuerySpecs in the input tree. Consider the following example:
SELECT c1 FROM t1 GROUP BY c1 HAVING EXISTS (SELECT*FROM t2 WHERE t2.x>MAX(t1.c2))
In this example, the MAX aggregate is in fact computed in the outer QuerySpec, although it is syntactically located in the inner one. However, in SQL, there is no explicit notion for this—the decision is made implicitly based on the aggregate's argument. Consequently, every aggregate needs to be bound to its hosting QuerySpec in QP Algebra, and the process of doing so is what is referred to herein as “aggregate binding” is the process undertaken by the step of Aggregate Binding 418.
Type Derivation 420 is the step where the types of any scalar nodes and full metadata for all relational nodes are determined, a necessary step since TSQL is statically typed. (TSQL, a.k.a. “Transact-SQL,” is a set of programming extensions that add several features to the SQL including transaction control, exception and error handling, row processing, and declared variables.) This step is done in a bottom-up fashion, starting from leaf nodes: columns (whose type information is read from the catalogs) and constants. Then, for non-leaf nodes, the type information is derived based on the particular node type and the types of its children nodes.
Property Derivation 422 is where the Algebrizer 410 determines whether the individual statements being processed are deterministic (that is, where repeated executions always net the same resultant value), precise (that is, where the same execution of a processor-influenced operation, e.g., the precision of a floating-point operation, nets the same resultant value on different processors), accesses database data (as opposed to, e.g., processing only scalar values), and other properties of the statement that are necessary for the correct and maximally efficient execution of the statement. The final step for the Algebrizer 410 is Tree Translation 424, where the resultant in-process SQL Tree—enhanced and modified by the previous steps—is finally translated directly into QP Algebra.
To complete each of the aforementioned steps, the typical algebrizer of FIG. XX requires six passes through the SQL Tree. However, none of these passes include a step of Constant Folding 432 which, as illustrated in
Constant Folding 432 is a process whereby queries are simplified by removing statements that inevitably resolve to a scalar value. For example, consider the following SQL statement:
SELECT c1 FROM t WHERE c1>3 AND (SIN(30)*SIN(30)+COS(30)*COS(30)=1)
In this example, it is clear to see that the subpart of the second condition of the AND statement, “sin2(30)+cos2(30)=1”, always returns a value of “TRUE” because sin2(x)+cos2(x)=1, and thus this second condition can be reduced to the single value of “TRUE.” Further, since the result of an AND statement is “TRUE” if either condition is “TRUE,” and since the second condition is always “TRUE,” the AND statement itself is also always “TRUE.” Thus, the original example statement can be rewritten as follows:
SELECT c1 FROM t WHERE c1>3
In this regard, Constant Folding 432 is the process that identifies and replaces (“folds”) these kinds of scalar values (“constants”).
Improved Algebrizer
The SQL Algebrizer of the present invention comprises a plurality of algorithms to “algebrize” SQL Trees to QP Algebra using an approach that is more consistent and much more efficient than typical algebrizers. More specifically, the Algebrizer of the present invention processes a SQL Tree using a reduced number of recursive depth-first passes by performing multiple operations in a single pass. Furthermore, the Algebrizer of the present invention also performs the operation of constant folding in this single, multi-operation pass so that the QP, upon receiving the QP Algebra, needs not perform this operation at all.
Conclusion
The various systems, methods, and techniques described herein may be implemented with hardware or software or, where appropriate, with a combination of both. Thus, the methods and apparatus of the present invention, or certain aspects or portions thereof, may take the form of program code (i.e., instructions) embodied in tangible media, such as floppy diskettes, CD-ROMs, hard drives, or any other machine-readable storage medium, wherein, when the program code is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention. In the case of program code execution on programmable computers, the computer will generally include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. One or more programs are preferably implemented in a high level procedural or object oriented programming language to communicate with a computer system. However, the program(s) can be implemented in assembly or machine language, if desired. In any case, the language may be a compiled or interpreted language, and combined with hardware implementations.
The methods and apparatus of the present invention may also be embodied in the form of program code that is transmitted over some transmission medium, such as over electrical wiring or cabling, through fiber optics, or via any other form of transmission, wherein, when the program code is received and loaded into and executed by a machine, such as an EPROM, a gate array, a programmable logic device (PLD), a client computer, a video recorder or the like, the machine becomes an apparatus for practicing the invention. When implemented on a general-purpose processor, the program code combines with the processor to provide a unique apparatus that operates to perform the indexing functionality of the present invention.
While the present invention has been described in connection with the preferred embodiments of the various figures, it is to be understood that other similar embodiments may be used or modifications and additions may be made to the described embodiment for performing the same function of the present invention without deviating there from. For example, while exemplary embodiments of the invention are described in the context of digital devices emulating the functionality of personal computers, one skilled in the art will recognize that the present invention is not limited to such digital devices, as described in the present application may apply to any number of existing or emerging computing devices or environments, such as a gaming console, handheld computer, portable computer, etc. whether wired or wireless, and may be applied to any number of such computing devices connected via a communications network, and interacting across the network. Furthermore, it should be emphasized that a variety of computer platforms, including handheld device operating systems and other application specific hardware/software interface systems, are herein contemplated, especially as the number of wireless networked devices continues to proliferate. Therefore, the present invention should not be limited to any single embodiment, but rather construed in breadth and scope in accordance with the appended claims.
Finally, the disclosed embodiments described herein may be adapted for use in other processor architectures, computer-based systems, or system virtualizations, and such embodiments are expressly anticipated by the disclosures made herein and, thus, the present invention should not be limited to specific embodiments described herein but instead construed most broadly. Likewise, the use of synthetic instructions for purposes other than processor virtualization are also anticipated by the disclosures made herein, and any such utilization of synthetic instructions in contexts other than processor virtualization should be most broadly read into the disclosures made herein.