The present disclosure relates to a device, system and method for answering relational database queries, object-relational database queries, and graph database queries.
Relational model is historically and practically the most common way of expressing all kinds of data which are stored and accessed by software applications. A number of query languages exist, most notably different versions and dialects of SQL, and LINQ. Key-value stores and RDF stores can be regarded as particular cases of the relational model, and can be queried with certain useful (albeit not necessary) extensions of these query languages. General-purpose object stores and graph databases also exhibit many features of the relational model, and can be mapped to it.
The theoretical foundation of the relational model is the Relational Algebra. It was originally formulated by Edgar F. Codd in 1970-es, defining relations as sets of tuples, and employing a number of basic operators, such as unions, joins, projections, etc. An important problem, however, is that most software applications prefer to deal with multisets—often because it better reflects the application semantics (duplicate tuples do matter), and anyway because removing duplicates is expensive, and so better avoided unless really needed.
Embodiments presented herein offer the advantage of representing, transforming, and answering all kinds of queries against relational databases (such as those expressed in LINQ, SQL, and their object/graph oriented extensions) as algebraic expressions, by providing formal infrastructure for representing relational data model and the operations over it, and representing different query optimization techniques as sets of transformation rules applicable to the said algebraic expressions.
According to the first aspect, there is provided a system (100) for representing, optimizing, and answering queries against relational databases, or object-oriented databases, or graph databases.
The system comprises specification of tuples and k-relations (111), where rows, object instances, or graph nodes and edges are represented by tuples, and stored collections of rows, object instances, or graph nodes and edges are represented by k-relations, wherein the collections represented are either sets or multisets. Besides being the basis of the system (100), this additional allows common data representation for multi-model databases.
The system further comprises specification of two basic algebra operators: natural join (113) and natural union (114), which allow deriving one or more other database algebra operators as a composition of these two basic operators over the stored k-relations (111), empty k-relations, and generated sets (112), which are infinite k-relations finitely computable for the purpose of the natural join in the context of the query being processed. This allows reasoning about the algebraic properties of the derived operators in a uniform way, defining transformations which are known to produce equivalent forms, and optionally defining the said transformations only in terms of the two basic operators.
The system further comprises means for representing database queries as natural algebra expressions, which are expressions comprising one or more of a natural join operator (113), a natural union operator (114), and other derived operators (120) over stored or generated k-relations. This algebraic representation is general enough to express queries translated from a number of different query languages, including relational, object-relational, and graph query languages, and further allows for different forms of query interoperability and multi-query optimization across languages.
The system further comprises means for transforming the said expressions by systematically applying transformation rules, including specification of these rules, and sets of rules for separate query processing tasks such as simplification, normalization, logical and physical optimization. This constitutes a framework for more general and powerful query optimization than what is available in database management systems based on algebras involving greater number of independently defined operators, and in systems which do not employ algebraic representations of database queries.
According to a further aspect, there is provided a method (200) for relational, object-relational, or graph query answering based on a sequence of algebraic representations (221, 241, 242, 243).
The method comprises translating (211-219) a database query (201-209), to a Natural Algebra representation (221), thus obtaining a general and language-independent equivalent of the said query, which allows reasoning about its properties, and serves as a basis for the next step.
The method further comprises systematically applying (230) transformation rules (231-233) in order to transform the Natural Algebra representation (241) of the query being processed to a logically optimized representation (242), and eventually to a physically optimized executable representation (243). This unifies a number of query optimization techniques within a single framework, where each step or approach is expressed declaratively as a set of transformation rules.
The method further comprises evaluating (280) the physically optimized executable representations against a database instance (270) in order to obtain a query answer (290). Evaluating an executable Natural Algebra expression, independent of the query language, further allows having a single execution mechanism for a number of high-level data models and query languages, which are translatable to Natural Algebra. It further allows cross-query runtime optimization across data models.
The method (200) assumes that at least the final algebraic query representation (243) is executable in each particular embodiment. For this purpose, an embodiment should include an execution engine (280) which accepts an algebraic query representation where all operators are physical operators. The execution engine is still free to perform certain transformations (such as cost-based join ordering). Though the particular design of the execution engine (280) is out of the scope of this disclosure, the method (200) dictates that the database data (270) can only be accessed at the query answering stage (280) when the executable NA translation (243) is already obtained. This allows decoupling the query execution part of query answering from query optimization, the latter being facilitated by algebraization (210) and transformations (230). In contrast, database schema (250), index catalog and statistics (260) are available during the query optimization stages.
According to a further aspect there is provided a computer program loadable into a memory communicatively connected or coupled to at least one data processor, comprising software for executing the method according to any of the embodiments presented herein when the program is run on the at least one data processor.
According to yet another aspect there is provided processor-readable medium, having a program recorded thereon, where the program is to make at least one data processor execute the method according to of any of the embodiments presented herein when the program is loaded into the at least one data processor.
Embodiments presented herein offer the advantage of representing, transforming, and answering all kinds of queries against relational databases (such as those expressed in LINQ, SQL, and their object/graph oriented extensions) as algebraic expressions, by providing formal infrastructure for representing relational data model and the operations over it. All such operations are derived from the two basic ones: natural join and natural union, so their algebraic properties are also derived. This allows a uniform representation of database queries, and more general and powerful query optimization, among other things.
The invention is now to be explained more closely by means of preferred embodiments, which are disclosed as examples, and with reference to the attached drawings.
The present disclosure describes a system and method for answering all kinds of queries against relational databases, or any databases which expose (are mapped to) the relational data model. This is done by specifying algorithms data structures for representing relational data model and the operations over it, and by translating any SQL, LINQ, and other relational queries to algebraic expressions, which are transformable and executable within this infrastructure.
We start by describing natural algebra in terms of its types (data structures) and operators, and then cover the related practical aspects of query processing, such as algebraization and execution. We then continue with an extended set of examples in Section 2 each of these examples is designed to demonstrate one of the basic features of the practical database queries, and their handling within our system (100), by applying the steps of the method (200).
A more step-by-step overview of the algebraization method is given in Section 3 using SQL as an example language. The framework for rule-based query optimization, is presented in Section 4. Section 5 provides more examples in application to the object/graph query language Starcounter QL. Section 6 discusses different options for the execution environment, where Natural Algebra expressions are used to obtain query results.
In this formal infrastructure, the queriable data is represented by k-relations, which are multisets of tuples of named values. Tuples map named attributes to the values of the respective domains, for example, t=name→‘John’, age→32 (despite the angular brackets notation, the order of the mappings does not matter), and k-relations map a set of tuples to numeric-valued multiplicities, for example, A={t*1, u*3}. The multiplicity of a tuple t in a k-relation A is denoted as kA (t) We say that a k-relation ‘contains’ a tuple, t∈A if and only if its multiplicity is non-zero: kA (t)≠0.
An equivalent way of defining k-relations would be to attribute multiplicities k(t) to tuples per se, thus defining k-tuples, a→a(t), b→b(t), . . . , k(t), and k-relations would be sets of such k-tuples. All the definitions given below can be re-reformulated based on such formalism as well, but theorem proofs would become more complex, since they will have to deal with multiple equivalent set representations of the same k-relation.
The header H(t) of a tuple is the set of all attributes which it maps to values, but the header H(A) of a k-relation is always explicitly given as part of its definition. For example, if a k-relation represents a table in a relational database, the header of such k-relation is the set off its column names and the associated data types. This also allows empty k-relations ∅X with a specified header X. A tuple may also be empty, as denoted by and a k-relation containing only an empty tuple with multiplicity n is called a scalar k-relation: {*n}.
In some embodiments the ‘stored’ or ‘basic’ k-relations may be required to contain only non-negative integral mutiplicities of tuples, which covers the needs of all the mainstream relational databases, and all datasets which can be mapped to such databases (relational databases, object stores, RDF, graph databases, etc.) However, this is not a requirement for the Natural Algebra, it merely makes certain traditional query optimization techniques readily applicable. Other embodiments might represent probabilistic data, fuzzy sets, or other continuous measures of existence by utilizing real-valued multiplicities.
Also, as demonstrated in Example 11 (set difference) and Example 15 (antijoin), negative multiplicities are useful, and k-relations containing negative multiplicities of tuples may be created as intermediate results, or used to represent updates to the database.
One basic operation on k-relations is natural join which produces a k-relation consisting of joined pairs of the compatible tuples from the argument k-relations. Two tuples are compatible t↔u if and only if they map the same attributes to the same values, for example, t=name→‘John’, age→32 and v=name→‘John’, ssn→1234. By joining compatible tuples we obtain a tuple containing union of the mappings: t·v=name→‘John’, age→32, ssn→1234. In particular, two tuples with non-overlapping headers are always compatible, and their join is always defined. Also, since mappings in the tuple are not ordered, tuple join is commutative: t·v=v·t.
Natural join operation of two k-relations joins all pairs of the compatible tuples from its left and right arguments, and multiplies their respective multiplicities:
The header of the result of a natural join is defined as union of headers of its operands:
Natural join operation is commutative and associative, and uses scalar k-relation {*1} as the identity element. Therefore Natural Algebra is a commutative monoid with respect to the natural join operation.
Different embodiments may include different implementations of the natural join operation, depending on the internal representation of k-relations, available access paths to tuples with specified values (i.e. database indexes), and other technical choices. The most common examples are nested-loop join (either using scans or index lookups), hash join, and sort-merge join,
More optimized implementations of join operation may be included for the specialized cases, such as joining under many-to-one guarantee, which is typically provided by non-nullable foreign keys in the relational databases, or for joining sets of nodes/objects which directly link one to another, as is usually the case in graph databases/object stores.
Another basic operation is natural union + producing a k-relation containing the ‘common’ parts of all tuples in the argument k-relations, with the multiplicities of the matching tuples summed up.
where X is the common set of attributes of the operand k-relations, which is also the header of the result of the natural union:
For example, the natural union of k-relations each containing a single instance of tuple t and v from the previous example, would contain two instances of tuple mapping the only common attribute: {t*1}+{v*1}={name→‘John’*2}
Natural union operator is commutative and associative, and uses empty k-relation Øs with the universal header Ω as the identity element. Therefore Natural Algebra is a commutative monoid with respect to the natural union operation. We do not claim the inversibility property of the natural union operator (which would make Natural Algebra an abelian group w.r.t. this operator). Even though for each k-relation A there exists an ‘inverse’ k-relation
the result of natural union A+(−A)=∅H(A) depends on the header of A, and does not produce the aforementioned identity element ∅Ω.
The domain Dom(X) is the (potentially infinite) set of all possible tuples with header X. Though the definition above implies running t through an infinite domain, any embodiment will have to include a finite-time and finite-space implementation of the natural union. This is certainly possible, since parts of only those tuples which are contained in at least one of the operands will get a non-zero multiplicity in the result of the natural union.
The above notation assumes that attributes are identified by name and domain, i.e. there cannot be attributes with same names but different domains in headers of k-relations.
However, this is not the case in the classical relational databases, as one can define tables with homonymic columns of different types, and then try to compute a relational union of such tables. Embodiments which are designed to be compatible with existing database management systems supporting SQL need a specific implementation of natural union, which would emulate the behavior of SQL union operation in this aspect -either throwing an error if the domains are different, or trying to construct common wider domain for the respective attribute in X.
Natural join and natural union are symmetrical in a certain way. Natural join collects only matching tuples, while widening header, and natural union collects all the tuples, but narrows the header.
The distributivity of natural join across natural union requires that the set of attributes participating in the natural join remains the same:
The conditions for the distributivity of natural union across natural join are much more restrictive—it is required that all headers are the same, and the k-relation A is flat (i.e. contains each distinct tuple with multiplicity 1) and disjoint from the others (i.e. produces empty natural join with either of them):
Since the distributivity laws do not hold in general, we do not claim Natural Algebra being any sort of a ring-like algebraic structure.
The examples 0-21 below show practical usage of Natural Algebra for representing queries in standard SQL. Further examples, featuring a query language extended for querying graph databases and object stores are provided in Section 5. Every Natural Algebra expression is executable in any embodiment, implying that every operator is either defined in terms of other operators (ultimately, in terms of natural join and natural union), or has at least one implementation in that embodiment.
Also, any Natural Algebra expression is easily transformable to equivalent expressions using transformation rules. Some embodiments may define physical (executable) operators which correspond to complex patterns of Natural Algebra subexpressions, such as Range(x
In the preferred embodiment any Natural Algebra expression is stored and processed in the machine memory as an executable S-expression consisting of operators and their arguments. Such representations are provided in every example below.
In order to accommodate different kinds of equi-joins, theta-joins, and subqueries in SQL, every table/class listed in the FROM clause of an SQL query is translated to Natural algebra along with its alias in the query (or the table name itself acting as the default alias). A simplest SQL query
where ρP. is the prefixing operator, which transforms the header of its argument relation, by prefixing each attribute name with the specified prefix P. We can also express the header of the result of this query, by prefixing the header expression:
meaning that for every attribute ai∈H(A) there will be a corresponding attribute A.ai in the result.
The formal definition of ρP., which ultimately builds on natural join and natural union is provided in the Example 4 below. However, since this operator does not touch the actual data, any embodiment is likely to have a simple implementation for it. An expression ρA.(A) is also a good target for partial computation, since it only depends on the table alias (coming from the query) and table columns (coming from database metadata), which are available before query execution.
A preferred embodiment, representing Natural Algebra expressions as executable Lisp S-expression, thus providing full compositionality, also defines PREFIX operator for this purpose. The S-expression equivalent of the above translation would be
As we will see later, theta-joins, equi-joins and Cartesian products, which always need to distinguish between the attributes coming from left and right operands, provide a notational facility/‘syntactic sugar’ for prefixing.
Most of the work on the relational algebra so far is only concerned with so-called SPJ queries, i.e. queries which contain only select, project, and join operators of the relational algebra. The following SQL query is an isolated example of the selection operation:
Here a predicate P(t)=(x(t)>0), is applied to every tuple t∈A. We represent such query with a Natural Algebra expression
or an executable S-expression in the preferred embodiment:
In this representation, {A.x>0} is a generated set-a possibly infinite k-relation with header {A.x} and a tuple with each possible value for x such that x>0 is true.
In the preferred embodiment, the constructor macro (FILTER (NP.>A.x 0)) creates such a generated k-relation which, receiving the known values of x through the natural join, either produces a compatible tuple A.x or not. The scalar null-propagatating function NP.> will not raise an error if x is Null, returning Null as a result instead, which in this case would be effectively same as false.
Another typical SQL query is a selection of some of the columns from each tuple in a relation, known as the projection operation:
The projection operator Π in Natural Algebra can be expressed as a natural union with an empty k-relation with the specified header, hence two options of the representation
In the preferred embodiment there are both ‘operator-rich’ and ‘operator-homogenic’ equivalent configurations of the same expressions, so the equivalent representations are
Any embodiment supporting alternative configurations of the equivalent Natural Algebra expressions can be equipped with transformation rules to perform such rewrites in at least one of the directions, for example
which is a context-free rule with symbols A and X. A reverse transformation,
is also context-free, however, it needs to handle associative cases of the n-ary natural union+operator.
The whole rule-based transformation framework is part of the system and method for optimizing Natural Algebra representations of the relational database queries, described in Section 4 below and is part of the patent claims.
SQL allows giving output columns specific names. The simplest case is renaming a projected column:
A direct Natural Algebra representation would include an extension with a generated set:
or, in the preferred embodiment, as an executable S-expression
Here the constructor macro (EXTEND z A.x) creates such a generated k-relation which, receiving the known values of x through the natural join, always produces a compatible tuple x, z such that the value for z is provided by the scalar expression A.x.
Any valid scalar expression that would appear before AS in the SELECT clause of an SQL query can be translated to an S-expression which comes as the last argument to EXTEND.
In the most general case, a generated set G={P(Z)}, having the header Z, based on predicate P(Z) which is dependent on the set of attributes Z, consists of a number of computable directions di=Zi, pi, gi. Here each Zi ⊆Z is a set of bound variables required for computation, predicate pi(Zi) defines whether a tuple with given values for Zi should be present in the generated set, and if so, function gi(Zi) generates a whole tuple with header Zi directly copying the values for the attributes in Zi and computing the values for the remaining attributes in Z\Zi.
By default, a generated set constructed with (FILTER P(X)) has single direction d=X, P(X), tX and a generated set constructed with (EXTEND y f(X) has a single direction d=X true, tX·(y→f(X)). In this notation, input tuple tX with header X is extended with an extra mapping (y→f(x)).
However, in certain important cases more directions can be included. For example, if a filter predicate has a form P(X)=(y=f(X′), where X′=X\{y}, an additional direction d′=X′, true, tX′·(y→f(X′)) is included into the generated set, which effectively means that a ‘filter’ is always true. By transforming the scalar expressions inside the predicate, more equivalent forms of P(X) can be derived, with different variables singled out on one side of the equality—hence more computable directions.
A natural join AG of a finite k-relation A with a generated set G={P(Z)} is finitely computable if and only if there exists a direction Zi, pi, gi∈=G such that Zi⊆H(A).
For example, consider a Natural Algebra expression AB{x=y}, where x∈H(A)\H(B) and y∈H(B)\H(A). If the generated set were produced by a FILTER constructor
In English, the following query could be phrased as “return all columns from tables A and B and all rows where value in column x in A matches the value in column y in B”. This query involves an equi-join operation, which is very common in queries of the mainstream relational database management systems, but is not fully supported by the existing algebraic approaches so far.
This query translates to Natural Algebra using the corresponding equi-join operator:
or, in the preferred embodiment, as a computable S-expression with E-JOIN:
As was mentioned earlier, we have only two basic operators in Natural Algebra: natural join and natural union +. An equi-join, in the general case, is defined for the aligned vectors {dot over (X)}=a1, . . . , an and {dot over (Y)}=b1, . . . , bn of the names of those attributes which participate in the equality requirement:
and is only defined if H(A)∩H(B)=∅, so that the result of an equi-join contains all attributes from both of its arguments. This condition is typically provided by prefixing operators inserted into the translation.
The above query can be rewritten, according to the definitions of equi-join, as:
or, as an executable S-expression in the preferred embodiment:
Here {A.x=B.y} is a generated set with columns A.x and B.y and all possible values for A.x and B.y from the respective domain(s), such that the equality holds.
The renaming operator ρ (corresponding to the PREFIX function in the S-expression) is also ultimately defined via natural join, natural union (as projection), and a filtering generated set:
The query from this example can be expressed in terms of basic operators and generated sets as:
Such verbose representation, however, is not required—neither for internal representation nor for computation of the query as Natural Algebra expression.
Not the most common, but the most basic (i.e. without any explicit conditions) way to join two tables in SQL is to join them on all matching attributes-the so-called natural join:
We do not map this to the natural join operator in the Natural Algebra, since we still want to track the provenance of the columns in the result of this query. Instead, we translate SQL natural join as a particular case of an equi-join:
where X=H(A)∩H(A) is the set of common attributes of the tables joined, and vector {dot over (X)} contains these attributes in a certain arbitrary order.
A corresponding executable S-expression in the preferred embodiment is, naturally:
It should be noted that the natural join operation is rarely directly useful in queries. The main reason is that the result of a query containing natural join operation is sensitive to schema changes, so the same query would return different sets of tuples (albeit the same header) if a column with a matching name is added to or removed from one of the operand tables. In some SQL dialects, such as Microsoft T-SQL, natural join operation is not available at all.
Though discouraged in practice, Cartesian product operation is allowed by most query languages, as an extreme form of a join, where no join predicate is supplied or equivalently, if the join predicate is constant true:
This query translates to Natural Algebra using the Cartesian product operator:
or, in the preferred embodiment, as a computable S-expression with E-JOIN, and empty vectors of matching attributes:
As can be seen from the latter representation, Cartesian product is the particular case of an equi-join, where vectors {dot over (X)} and {dot over (Y)} are empty, i.e. no equalities are required:
which is only defined if H(A)∩H(B)=∅, which is typically provided by inserting rename operators into the translation of an SQL query.
In SQL, two tables can be joined on any logical condition, not only equality. Consider a slight modification of the query from Example 4:
In the preferred embodiment, this corresponds to an executable S-expression joining a Cartesian product with a filtering generated set:
In Natural Algebra, theta-join with condition θ(P.X,Q.Y), dependent on attributes X from the left operand and attributes Y from the right operand, is defined as:
where {θ(X,Y)} is a generated set corresponding to the join predicate. This definition inherits the Cartesian product's requirement of disjoint headers H(A)∩H(B)=∅, which is typically provided by inserting rename operators into the translation of an SQL query. Technically
In the preferred embodiment, equi-join which is not a Cartesian product has its own implementation, since it can avoid the exhaustive iteration across its both operands: by utilizing the indexes: using the equality filter, it can ‘bind’ the variables for the generated sets, as shown in note to Example 3.
This is not the case for theta-join that, in general, relies on the Cartesian product with join predicate applied to filter its results. However, different embodiments might involve a range of rewrites and other optimizations, in order for theta-join to utilize, for example, range indexes.
Query from this example, represented in Natural Algebra as a natural join of two tables with disjoint headers, and one generated set:
cannot join ρA.(A) with {A.x>B.y} first, since the generated set has a single direction where both A.x and B.y need to be bound. However, some embodiments may implement a physical operator Range(−∞,x )y(B), based on a range index for attribute y in k-relation B, so that for any given value of x it selects all tuples from B such that y<x. Algebraically, this query would be rewritten to
Formally, the expression Range(−∞,A.x )y(B)is a generated set which only requires {A.x} as a set of bound attributes. Such a rewrite avoids the computation of the Cartesian product ρA.(A)ρB.(B), and, if the range index lookup happens to be selective, drastically reduces the time complexity of answering the query.
A simple example of grouping and aggregation query in SQL
In the preferred embodiment this corresponds to an executable S-expression
In Natural Algebra we define grouping and aggregation as a natural union over tuples across the domain of the grouping sub-header:
where the scalar function I(At) returns 0 if the argument k-relation is empty, and 1 otherwise. The dot operator · inside the braces denotes tuple multiplication by a tuple with non-overlapping header. The same operator outside the braces is scaling of a k-relation with 0 or 1 multiplier. Aggregate function FXZ produces a tuple with header Z for each group At and depends only on the values for attributes X in the group.
While the domain Dom(G) is typically infinite, for any finite k-relation A there is a finite number of non-empty groups At, so the operator is always computable. In some of the embodiments, the implementation of AGGREGATE maintains a hash-table, where sub-tuples t serve as keys, and incrementally computed values of aggregate function FXZ as values. Tuple t is iterated across the distinct set of ΠG(A), thus resulting in only non-empty groups At.
Query from this example can be expressed in as
so that for a k-relation
the domain Dom({y}) might be infinite, depending on the type of the column y, but the only tuples from it, which would produce non-empty groups are y→2 and y→3.
Aggregate function Sum{x}{z} is defined as:
which would produce resulting tuples y→2, z→3 and y→3, z→6.
An important particular case is aggregation without grouping. In SQL such query would be expressed as
However, it's not the same thing as grouping on no attributes Ξ∅, since the above SQL query is guaranteed to produce at least one tuple, whereas Ξ∅ operator would return an empty result on empty input.
Instead, the latter query would be translated to Natural Algebra as a direct call to the aggregate function:
where the definition of Sum{x}{z} takes care of the case where input is empty, emulating the standard SQL behavior.
Another example of queries which typically lack formalization in the relational database theory is aggregation applied to multiple columns, also expressions under the aggregate function calls and across the results of aggregate functions. The following query has all these features:
It translates to the Natural Algebra as
where the first argument to Ξ is a vertical composition of the standard single-column aggregate functions. The combined aggregate function Sum{_aa1}{u}·Max{A.x}{_a2}·Min{A.x}{_a3} produces a tuple with header {u,_a2,_a3}, using the attributes {_aa1, A.x} from a k-relation which is the second argument to Ξ.
Any embodiment performing the translation of such query to Natural Algebra will need to introduce the intermediate computed attributes, both inside the argument k-relation, and outside the Ξ operator. The algebraizer in the preferred embodiment uses generated identifiers starting with underscore and ‘a’ for the results of the aggregate functions or with ‘aa’ for the arguments to aggregate functions. The projection to the set of columns explicitly mentioned in SELECT clause is added on top of the expression—otherwise _a2 and _a3 would also appear in the result.
The executable S-expression in the preferred embodiment demonstrates the usage of COMBINE-AGG macro, utilizing the built-in aggregate functions such as SUM, MAX, and MIN by supplying them with argument and result attributes, and combining them into a single multi-column aggregate function:
Removal of duplicates (the distinct operator) is common in relational databases, especially in situations where multisets might be produced, e.g. due to a projection:
This translates to Natural Algebra as
which is represented as an executable S-expression in the preferred embodiment:
The distinct operator ∥⋅∥ is a particular case of grouping and aggregation, where grouping is performed on all attributes of the argument k-relation (as specified by the projection in our example), and the aggregate function is the constant empty tuple:
Depending on the internal implementation of k-relation, the implementation of distinct operator might be different in different embodiments. If, for example, k-relations are always internally represented as sets of unique tuples with multiplicity value attached to each of them, the distinct operator would simply clone a k-relation and set multiplicities to 1. Other embodiments might use hash tables, or hash-table based stream windows to accommodate distinct over streaming k-relations, etc.
Most query languages incorporate both set and multiset semantics, switching between them in a number of explicit and implicit ways. One explicit way to switch to the set semantics is the distinct operator in SQL. Some operations, such as UNION, INTERSECT, and EXCEPT are designed to operate on sets of tuples with matching headers, and produce sets of tuples. The translation of such operations to Natural algebra involves (a) inserting distinct operator around the arguments or result, and (b) checking that the headers of the argument relations are union-compatible.
The notion of union-compatible does not depend on the column names, but depends on the implicit order of the attributes in each header, and their data types. We discuss the order in the following notes, so far we assume that we try to ‘cast’ one k-relation to the header of another.
Operator Cast{dot over (H)}(A)(B) would assume there are ordered headers {dot over (H)}(A)=a1, . . . ,an and {dot over (H)}(B)=b1, . . . ,bn with the same length n, and the domains of the respective attributes either agree
or (optionally, in some embodiments), there exists a wider domain {tilde over (D)}i containing the original ones:
If these conditions hold, the result of Cast{dot over (H)}(A)(B) operator is k-relation B with header substituted to {dot over (H)}(A):
If wider domains {tilde over (D)}i were derived for some of the attributes ai∈{dot over (H)}(A), the domain of such attribute in the resulting header would be set to {tilde over (D)}i.
Now, consider an SQL query
or to the following executable S-expression in the preferred embodiment:
The multiset-based SQL counterpart of it would use UNION ALL instead:
or to the following executable S-expression in the preferred embodiment:
Similarly, set-based INTERSECT operator in SQL, as in
or, to an executable S-expression in the preferred embodiment:
Finally, EXCEPT operator in SQL denotes the set difference of the results of two queries with matching headers:
The set difference is Natural algebra is defined on the distinct views as a subtraction (i.e. natural union with the negated second argument) of the left k-relation and the set intersection of both:
This corresponds to an executable S-expression in the preferred embodiment
where N* operator denotes a scalar multiplication of a k-relation.
So far we were regarding headers of k-relations as unordered sets. However, the following simple rules provide an order of attributes, which is compatible with common SQL ordering for the queries translated so far:
where | is a vector concatenation operator, and set difference \ of a vector and a set produces a sub-vector.
Note that these rules effectively break the commutativity of tuple join, natural join, and natural union operations, that's why in Natural Algebra we treat headers as unordered sets of attributes. All transformation rules for Natural Algebra expressions and any cost-based query optimizer (whose task is to re-order the join operands) should still treat these operations as commutative.
In order for set operations from Example 11 to be both SQL-compliant and independent on the join order chosen by optimizer, the header of each operand to a set operation needs to be computed first based on the ‘most direct’ translation, preserving the user specified join and projection order.
For example, an SQL query
Before making any further transformations, we memorize the user-specified ordered header of the right operand to UNION ALL:
If any expression transformations change the join order, we will still use the original order of attributes specified by {dot over (X)} for the purpose of Cast operator.
Most query languages provide some degree of composability, allowing a query to refer to the result of another query. In SQL, this is done either by view definitions and usage
Both cases are equivalent, and would translate to Natural Algebra as
or, as an executable S-expression in the preferred embedding:
Here the expression ρB.(B){B.y>0} is the translation of the subquery, which in the host query receives alias C (translated to prefix C.), and participates in a theta-join.
Natural Algebra is ultimately composable, allowing any kind of k-relation expressions in each place where a k-relation is expected. For this reason, the attribute prefixes are allowed to ‘stack’ one before the other.
Relational query languages, such as SQL, allow greater interconnectedness among the query parts than simple context-free composability. Correlated subqueries rely on the variable bindings of the host query-much like the generated sets in Natural Algebra rely on the variable bindings of the finite k-relations they are joined with. Correlated subqueries are common in filters (as shown in the following examples), but some dialects of SQL, such as PostgreSQL, allow correlated table-valued subqueries directly in the FROM clause-so called LATERAL subqueries:
Translating this query to Natural Algebra involves a predicate lifting technique. The Cartesian product from in the host query effectively gets joined with the predicate moved from the subquery, and becomes a theta-join:
or, as an executable S-expression in the preferred embodiment
which is quite similar to the theta-join from the Example 7.
A different technique is needed to translate the following correlated aggregate subquery
If the condition in the subquery were dependent on a grouping variable, we could have used the same predicate lifting technique as in the previous example. However, here variable y cannot be accessible outside the subquery. So in order to evaluate the condition dependent both on A.x and y, we need to instead bind A.x inside the subquery. This is achieved by the scaffolding technique:
Here the subquery is joined with the so-called finite scaffold ∥Π{A.x}(ρA.(A))∥ of the remaining part of the outer query that produces a superset of all the bindings which A.x would receive. The scaffold variable A.x also becomes a grouping variable, and a separate group is generated for each such binding. Then, after the Cartesian product is formed, only those bindings which correspond to the actual bindings of A.x in the host query are retained.
The above is the ‘compositional’ role of the finite scaffold, which makes it possible to compute every finite Natural Algebra subexpression in those embodiments which allow only context-free compositionality.
In different embodiments, the implementations of the grouping and aggregation Ξ operator might be different. In particular, if a grouping variable is bound by a join where the result of Ξ participates, the grouping may be limited to only those bound values, effectively binding the grouping variable inside the under-aggregate expression. Since the finite scaffold ∥Π{A.x}(ρA.(A))∥ guarantees a superset of such bindings, it can be safely omitted from the under-aggregate expression. The simplified equivalent
This, however, is not a compositionally computable Natural Algebra expression, and would only work in those embodiments which offer the said optimization for Ξ. Moreover, they should formally allow a variable which is not bound by any finite k-relation under the aggregate operator to be used as a grouping attribute.
Subqueries can also be found in conditions, used either with quantifiers, or returning a single value (Example 19 below). The simplest case is the existence quantifier, acting as a logical expression:
The check of whether the result of the correlated subquery is empty acts as a filter for the tuples from A:
The logical predicate ∃ is true, if the Natural Algebra expression under it is non-empty, and false otherwise. The whole term {∃(ρB.(B){A.x>B.y})} is a generated set with header {A.x}, computable only if {A.x} is bound. The preferred embodiment defines the SQ-EXISTS form to directly reflect the existence quantifier and its dependencies:
with the first argument to SQ-EXISTS being the list of the attributes which need to be bound for evaluating it, and the second argument is the Natural Algebra expression under the quantifier.
This is, however, just the most direct translation. Taken separately, a filter with existence quantifier corresponds to the semijoin operator in relational algebra, and a filter with negated existence quantifier corresponds to antijoin. Natural Algebra defines semijoin and antijoin operators:
and the query from this example can be rewritten as
Similarly to the previous example, such Natural Algebra expression requires that the implementation of semijoin effectively passes the bindings for A.x from the left operand into the right, just like the natural join does. Otherwise, if the implementation of semijoin requires context-free computability of both arguments, or if semijoin is rewritten according to its definition to be executed via projection and distinct, the right argument (ρB,B{A.x>B.y}) is effectively infinite and cannot be computed.
To resolve this problem, a similar scaffolding technique can be applied:
The finite scaffold ∥Π{A.x}(ρA.(A))∥ produces all possible bindings directly for A.x. As this is the only common attribute among the left and right parts of semijoin, those bindings that pass the test A.x>B.y with at least one of the tuples from B appear in the result of the semijoin. Hence the equivalence to the original translation.
Example 16. Subqueries with ANY Quantifier
Let's consider a subquery with ANY quantifier, as in
We do not need another quantifier to represent ANY in the Natural Algebra, as this query literally means “select those rows from table A where A.x is greater than at least one y from table B”. Hence the translation, based on existence (non-emptiness) of a certain selection from B:
which repeats the Natural Algebra translation from the previous example. Those queries are equivalent indeed, and the rewrites from the previous example are applicable. The key feature here is that the otherwise ‘uncorrelated’ subquery Π{B.y}(ρB.(B)) becomes correlated, once extended by the filter {A.x>B.y} translating the condition around the ANY quantifier. We also do not keep projection under the existence quantifier—the only role of the projection in the SQL query was to point to the particular column to compare with.
The full list of Natural Algebra operators which can be safely dropped directly under existence quantifiers includes projection, extension, any join with many-to-at-least-one guarantee, renaming and prefixing, Cartesian product if the other argument is known to be non-empty, also left, right and full outer join (as defined below in Example 20), distinct, aggregation and grouping.
SQL also supports a negated version of ANY:
or, as an S-expression in the preferred embodiment,
This translation can be rewritten to antijoin, as shown in the previous example:
or, more briefly:
if the finite scaffold ∥Π{A.x}(ρA.(A))∥ is not required by the particular embodiment.
Example 17. Subqueries with IN Quantifier
Most query languages, including SQL, allow providing explicit bindings for a certain variable, for example
or to an S-expression in the preferred embodiment, using COLUMN constructor:
However IN can be also used to introduce a single-column subquery:
This is equivalent to using ANY quantifier with equality:
or, to an executable S-expression in the preferred embodiment:
This translation can also be rewritten to semijoin.
Example 18. Subqueries with ALL Quantifier
The following SQL query can be phrased in English as “select rows from A where A.x is not greater than all y values in B”:
Though given English phrasing may sound ambiguous, the SQL standard defines this as equivalent to “select rows from A where A.x is not greater than at least one y value in B”:
which can be further rewritten to semijoin.
An equivalent S-expression in the preferred embodiment would be
Similarly, a non-negated case of ALL quantifier, “select rows from A where A.x is greater than all y values from B”:
and can be accordingly rewritten to antijoin.
An equivalent S-expression in the preferred embodiment would be
Sometimes the sole purpose of a subquery is to return a single value (i.e. exactly one result tuple mapping a single attribute). For example, a query “select all rows from A where x is greater than the maximum y value in B” can be formulated in SQL using a non-correlated single value subquery:
Here the single-tuple result is guaranteed by a direct call to the aggregate function. The Natural Algebra translation would have to construct a relation from a single tuple, hence the braces around the call to Max:
In SQL there is no concept of −∞, so MAX aggregate function would return Null if table B is empty. This behavior is also emulated by the definition of Max{z}{z} in Natural Algebra (compare with definition of Sum{x}{z} in Example 8):
A different situation is when single-tuple result is not guaranteed by the subquery expression, for example:
Here the existence and uniqueness of the result from the subquery Π{y}(B{w>10}) is a kind of external knowledge or promise which needs to be verified at run time. All relational databases which support SQL would throw a run-time error if the subquery in this example returns more than one tuple, and would effectively compare A.x to Null if the subquery result is empty.
The Natural Algebra translation of the above SQL query, in the most general case, is:
The key feature in this translation is a natural join with many-to-at-most-one guarantee. The same guarantee takes place when, for example, joining on a foreign key.
Embodiments which aim to fully simulate SQL semantics with respect to single-value subqueries will need to validate this guarantee on the actual data, and throw a run-time error if this guarantee is not fulfilled.
The preferred embodiment implements this run-time check with a function ASSERT-01, which returns the argument relation if its count is 0 or 1, and throws an error otherwise:
A stronger version of the above is many-to-one guarantee, a relationship between two k-relations which we denote using a graphical primitive from ‘crow foot’ notation in the relational database modeling. With respect to join this takes place, for example, when joining on a non-nullable foreign key. Other cases are proven, including
Naturally, this guarantee includes the weaker ‘many-to-at-most-one’ guarantee used in the above example.
Different embodiments may use these kinds of guarantees for a number of purposes. For example, the corresponding physical implementation of the join operator might use such guarantee to avoid iterating on the respective operand past the first result tuple. At the same time, cost-based optimizer might find such guarantee useful for more accurate estimation of cardinalities of intermediate results (the stronger many-to-one guarantee offers more accurate estimates).
The many-to-one guarantee also offers a number of simplification opportunities for Natural Algebra expressions, such as
which are useful for optimization and for proving query equivalences.
Standard relational algebra left, right, and full outer join operators are directly expressed in SQL, for example:
The result of the query would contain all columns from A and B, and all tuples from A listed at least once. If for certain tuple t∈A there are join-compatible tuples u∈B, all such tuple joins t·u will be listed, otherwise, tuple t will be extended with null values for attributes from the header of B.
There are simple ‘outer join’ analogs to natural join in Natural Algebra.
where NullX is a single-tuple k-relation mapping all attributes in X to Null value.
However, these operators cannot be directly used to translate an SQL query, since there is no such thing as a ‘natural outer join’ in SQL. In practice, outer join analogs to equi-join and theta-join are useful, for example:
with other kinds of outer joins being defined via the inner versions of theta-and equi-join in the same manner, under the requirement of disjoint headers: H(A)∩H(B)=∅.
The query from this example would then simply translate to Natural Algebra as:
or, as an executable S-expression in the preferred embodiment:
Ordered k-relations can be modeled in Natural Algebra as a k-relation extended with order attribute i. An SQL query returning a sorted relation
Translates to natural algebra using a sort operator Ψ:
Here A.x∧, A.y∨ is a sort-vector-a vector of attributes, where each attribute is annotated with either ascending ∧ or descending ∨ flag. The sort operator Ψ produces an ordered k-relation as specified by this sort-vector.
An ordered k-relation {dot over (A)} is a k-relation A where every tuple t∈A contains a special order attribute i(t). A sort-vector, formally, is either an empty vector, or any sort-vector Ÿ concatenated with an annotated attribute:
Header of sort-vector H({umlaut over (X)}) is a set of all attributes in {umlaut over (X)}:
The sort operator Ψ can be defined as a join:
where H({umlaut over (X)})={s1, . . . ,sk}, and the function i(s1, . . . ,sk) is any such function that satisfies the precedence condition defined by {umlaut over (X)} on the set of tuples:
Tuple t precedes tuple u according to the sort-vector {umlaut over (X)}, given H(t)=H(u)⊇H({umlaut over (X)}):
this requires ≤ relationship to be defined within every domain Dom(ai), ai∈H({umlaut over (X)}).
For an empty sort-vector, thee precedence relationship holds for any pair of tuples, and order function i( ) can be any constant. However empty sort-vector does not allow eliminating the sort operator: Ψ(A,)≠A, a special order attribute is still attached to A.
As the above examples have demonstrated, SQL functionality is totally covered by Natural Algebra, and every feature in SQL query has at least one Natural Algebra representation. For many features we have defined corresponding derived operators, based on natural join and/or natural union, hence the equivalent forms using only basic operators, or some intermediate derived ones. In this section we outline the process of obtaining a most direct Natural Algebra translation of an SQL query, which can be further transformed, as described in Section 4.
Though the translation methodology is presented here using SQL as an input query language, a person skilled in the art should be able to define a similar algebraization method for any other query language based on relational, object-relational, or graph semantics. Section 3.9 below gives an example of translating LINQ queries in a similar way.
We start building a Natural Algebra expression with translation QFROM of the FROM clause, treating different kinds of FROM-expressions, according to the following rules:
Transforming a theta-join into equi-join is best expressed as a rule-based transformation:
where {dot over (X)}={dot over (H)}(A){dot over (∩)}{dot over (H)}(B) is a vector of common attributes, built from the ordered headers using a non-commutative vector-based version of set intersection:
In a schema-aware translation, header expression {dot over (X)} is a good candidate for partial evaluation during the query translation phase.
If a query makes use of parameters, they are enumerated as p1, . . . , pm attributes distinct from any other attributes in the query, and are represented by a special k-relation Par, H(Par)={p1, . . . , pm}, which is guaranteed to contain single tuple and its values will be available at run-time just as values of any other k-relations. Par is joined with QFROM in order to obtain the full set of attributes available in the query:
If there are no parameters used in the query, QPAR=QFROM
In SQL queries WHERE clause contains a single Boolean expression P(X), dependent on attributes of the expression in FROM clause and query parameters. We can always directly map it to a filtering generated set joined with the translation of the FROM clause.
and if there is no WHERE clause in the query, QWHERE=QPAR
However, conjunctive WHERE conditions are quite common, and it is sometimes useful to rewrite them into a set join of translations of conjunctive parts:
For example, a query containing
and using also the transformation rule in section 3.1 this would be equivalent to a nested equi-join:
In SQL queries SELECT clause corresponds to the projection operator in Natural Algebra. However, any scalar expression effectively produces new attributes, so such expressions are translated to the generated sets. The following translation rules apply, in order to obtain the translation QSELECT:
If the Natural Algebra expression has no other attributes than those listed in the projection, the projection can removed, as specified by the following simplifying transformation rule:
This also applies to aliased expressions which consist of a single attribute.
In the SQL examples so far we always used attribute names qualified with table aliases or names (which are their implicit aliases), such as A.x, B.y. In a query where only a single table is referred in the FROM clause, this is obviously redundant, and forming an equivalent translation of a query with unqualified attributes is a trivial task. We would still translate a query
Example 11 from the previous section covers the technique that covers the set operations. At this point we have translation of at least two operand queries of a set operation, Q′SELECT and Q″SELECT, each having a projection operator on top, and we obtain Q′SETOP by header-casting and combining these projections. If there are no set operations, QSETOP=QSELECT.
Example 21 from the previous section illustrates the translation of the ORDER BY clause, by inserting the sort operator Ψ and forming the order-vector {umlaut over (X)} in all elementary cases. Here we show how does it compose with other SQL query features.
In a query with no set operations, the sort operator is inserted just under the projection which is introduced by SELECT clause, but above any joins with generated sets, which are introduced there also. This way sort operator Ψ has access to both aliases and any aliased attributes, including those not listed in SELECT clause. For example, the following two queries are equivalent:
However, in a query with set operations, projections are needed to form the compatible headers. For example,
This fact agrees with our technique for translating set operations. The translation of the above query would simply put the sort operator on top of the translation QSETOP:
An aggregate query in SQL is a query which contains at least one call to an aggregate function, either in SELECT, HAVING, or ORDER BY clauses. The set of aggregate functions is either limited to the built-in ones, and includes SUM, COUNT, MIN, MAX, AVG, or relies on an extensibility mechanism. An aggregate query may or may not have GROUP BY and/or HAVING clauses.
To illustrate the translation steps, we will follow an example, similar to Example 9 in the previous section, but featuring HAVING and ORDER BY clauses:
In order to determine if a given SQL query is an aggregate query, the following technique is used in our system: each scalar expression in SELECT, HAVING, and ORDER BY clauses is traversed in search for an aggregate function call. If such call is found, the expression under it is put in a list of aggregate calls, where each record F, e, a consists of such an expression e, the aggregate function F being called, and an explicit or generated alias a for this call. Our example query would produce the following list:
The aggregate expressions in the query expressions are rewritten to the aliases from this list, so here is our example query after rewrite:
If the list of aggregates is empty, given SQL query is not an aggregate query, and is translated normally, as described in parts 3.1-3.6 of this section.
The translation of an aggregate query includes inner and outer Natural Algebra expressions, each having their own set of available attributes. The inner expression is formed by translating the query with original FROM and WHERE clauses, and where SELECT clause lists all the distinct under-aggregate expressions listed in the first step, with generated aliases added whenever the expression is not a single attribute. The inner query corresponds to our example:
With generated aliases for under-aggregate expressions are introduced in this step, the expressions in the list of aggregates are rewritten to these aliases, so the first record in the list changes:
No matter how many aggregate functions are involved in an SQL query, their computation needs to be done in one pass over the results of the inner query. The computation itself can thus be a compound aggregate function. It is easily built from the list of aggregates, which is rewritten during the previous step.
The results of single-column aggregate functions are single attribute tuples with unique headers, so a tuple join effectively concatenates them. Each aggregate function has a specification of its input and output attribute in the list of aggregates, so our example we obtain Sum{_aa1}{u}·Min{A.x}{_a2}·Max{A.z}{_a3}.
The outer query consists of the original SELECT clause, rewritten at step 1, with HAVING condition in place of the WHERE clause, and any original ORDER BY clause. As a replacement for the FROM clause we use either a grouping and aggregation operator Ξ if there is a GROUP BY clause, or a direct application of our compound aggregate function to the result of the inner query. Such query is translated normally, as described in sections 3.1-3.6, and the complete translation is
If our example query had no GROUP BY clause (and consequently could not use y attribute in SELECT clause),
Note that the call to aggregate function produces exactly one tuple, and the expression under the sort operator Ψ is either single-tuple or empty k-relation, so the sorting can be safely removed. This query is equivalent to one without ORDER BY clause.
The spectrum of applicable techniques is illustrated in much detail in the examples 12-19 above, so here we provide a short summary.
1) In SQL subqueries may appear
2) There are correlated and uncorrelated subqueries. While the uncorrelated subqueries are straightforward to translate, thanks to the compositionality of Natural Algebra, correlated subqueries depend on the bindings of attributes in the host query, and require either certain rewrites during the translation (predicate lifting, finite scaffolding techniques shown in the examples), or an execution framework will allows passing the context down into the subexpressions being evaluated. We describe such an execution framework in Section 6.
The steps described above are easily generalized to other query languages, for example, SPARQL and LINQ. The following example of LINQ query has obvious counterparts of FROM, WHERE, ORDER BY, and SELECT clauses, which are described above:
Using the same approach, this query is translated to
Section 6 provides more examples of translating queries with object and graph semantics, featuring yet another input language-Starcounter QL, an object-oriented extension of SQL.
Natural Algebra is a common representation of database queries. Once a query is translated to Natural Algebra, it can be validated, optimized, and executed using the system and method defined in this disclosure.
Most of the transformations of the Natural Algebra expressions including simplification, normalization, transformations between operator-rich and operator-homogenic forms, logical and physical optimization, can be expressed in a context-free way as transformation rules, and are the preferred way of transforming Natural Algebra expressions.
A transformation rule consists for input pattern, output pattern, applicability function, and value transformer function. If omitted, applicability function is constant true, and value transformer function is identity function.
with input pattern ΠX(A) and output pattern ∅X+A corresponding to the definition of the projection operator. Any derived operator defined in this disclosure has at least one such rule, and the exhaustive application of this class of rules to any Natural Algebra expression results in operator-homogenic form of that expression, consisting only of natural join and natural union operators.
A reverse rule
would transform an expression towards an operator-rich form, by changing some of the natural union occurrences to projections. The star next to the symbol A means ‘match all the remaining arguments of the commutative and associative operator’. Otherwise, a combinatorial number of matches are possible, while only the fullest one makes practical sense.
Applying R2 against an expression A1+A2+∩{x}) would transform it into Π{x}((A1+A2), and how the input pattern is matched is summarized by the following table:
An example of a simplification rule
has the same feature, since natural join is also commutative and associative, and any single empty operand would make the result of the whole natural join empty.
Note that this kind of transformations can be performed both in schema-aware and schema-ignorant setting, i.e. whether header expressions are immediately evaluated to the sets of attributes, or retained in the symbolic form.
The note on join predicate rewrites after Example 7 given an example of logical-to-physical transformation, which can be expressed as a rule
where the applicability condition checks that the attributes used in inequality condition belong to the respective headers. Additionally, while A* pattern can match any ‘remaining’ part of the n-ary natural join, this rule only makes sense if B exposes a range index on attribute y, i.e. it is either a stored k-relation with such index defined, or is a projection/rename/union of k-relations having this property.
The set of physical indexes Idx(A) available for any subexpression A is propagated through these operators, according to the following principles:
These principles can be formulated as transformation rules, in order to bring the physical access operator Rangey closer to the stored k-relation, while transforming the query in Example 7:
where rule R5 is formulated after the principle (2) above w.r.t. the Rangey index and physical access operator:
The set of exposed physical indexes Idx(A) discussed above is an example of the physical trait which can be recursively computed from leaf to root. Other such examples would include expression's header H(A), positiveness A>0 (i.e. all tuple multiplicities are positive), distinctness guarantee (A=∥A∥), single-tuple guarantee (|A|=1), etc. In our framework this is an open set of attributes, which can be extended in conjunction to adding new transformation rules into the system.
Another class of traits are the contextual traits which needs to be propagated from root to leaves. Examples are a set of bound attributes ↓ Bound , or whether distinct operator is applied on top ↓∥⋅∥.
For example, the 75 ∥⋅∥ contextual trait is useful for transforming a disjunction of filters into a natural union. Under normal circumstances, where multiplicities of tuples do matter, such transformation includes a second-order term, which impacts the rule's usability, since it provides little simplification:
However, under the contextual trait of 75 ∥⋅∥, we have a true normalization/simplification rule:
The same effect can be achieved by without introducing the 75 ∥⋅∥ contextual trait. An alternative set of rules would be
and a number of bi-directional rules for propagating the distinct operator down and up, so that the input pattern of R8 can be eventually matched:
Having the condition in R7 formulated as a contextual trait not only reduces the number or rules which need to be added into the system, but more importantly, the amount of transformations needed to achieve the simplification expressed by the rule.
Transforming a query
which would transform the Natural Algebra expression into
which offers the potential for betted index utilization.
Previously in this section only few examples of different transformation rules were presented, together with particular features of the rule application framework. This disclosure does not aim to present a full and exhaustive list of transformation rules, instead, it presents a framework were rules can be easily formulated, managed, and executed. A transformation rule is a machine-readable and executable interpretation of a mathematical fact about Natural Algebra, the properties of its basic and derived operators.
In any embodiment of the present invention a system (framework) for applying transformation rules would include the following components:
As an optimization, each trait may have its own invalidation logic, while headers H (A) are invariant under all valid Natural Algebra transformations, the inherent sorting order {umlaut over (s)}(A) depends on the physical access operator (index) already chosen during the logical-to-physical transformation of the subexpressions. Similarly with contextual traits, while ↓∥⋅∥ is stable, the set of bound attributes ↓ Bound depends on the position of the current subtree in each n-ary join root-wise, and the attributes bound by the previous arguments to that n-ary join.
The rules framework is where most of the query optimization takes place. Certain algebraization tasks also make use of this framework, as illustrated by the examples in the next section.
Starcounter is multi-paradigm database, which allows querying stored classes using an extension of SQL language, here referred to as Starcounter QL.
In the view of Starcounter QL tables are classes, i.e. collections of instances (objects) with the same schema (set of fields). While one can select all field values from a certain class, using the SQL star syntax:
While the first query selects tuples mapping each attribute of class A to its instance values, the second query returns tuples of width 1, containing objects.
In order to accommodate this dichotomy, we assume that every stored table (class) has special attribute with a reserved name oid, besides all other attributes (fields) defined by the user.
For convenience of translating SQL star-syntax queries, we define a start-header function H*(A). For a particular stored k-relation A, it would contain all its stored attributes except for the reserved oid attribute:
If an argument k-relation is the result of a Natural Algebra operation, it might contain no oid attributes, or might contain multiple oid attributes from different tables. The header H*(AB) would filter out all such attributes from the result of a natural join:
This approach allows translating the first query from this example as
and the second query as
Similarly, pairs of objects form A and B can be returned from a query with a natural join:
The domain of the oid attributes introduced in the previous example is formally defined as a set of objects in the respective class:
While A as a k-relation may grow or shrink during the database lifetime, the respective domain will be changing accordingly. Other attributes in the same or other k-relations might have the same domain, for example a class B might have a field y, referring to objects of class A:
A Starcounter QL query may be formulated to retrieve all objects of class B, together with objects of class A which they reference with y field:
and the domain of the result of such query would be a Cartesian product of two classes as domains:
However, a query may instead retrieve values of x field of objects which are referred by y field of objects of class B:
Note that a prefix for the table A, representing the domain Dom(B.y) is chosen as the fully-prefixed name of that attribute.
In order to perform this translation, one needs to know the column type definition Dom(B.y)=A, which is part of the database schema. In Starcounter QL, one can define a table using a name of another table as a column type:
A star syntax can be used to extract all non-oid fields of y objects:
Which would translate to Natural Algebra as
Longer property chains, as in
There is an ongoing discussion regarding the semantics of property chain with respect to the NULL values in object-referencing attributes. The queries Q3-Q5 from the above example, given the Natural Algebra translations provided, will return no rows for those B.y values which are NULL.
Some applications might instead prefer as many result rows as there are bindings for B.y, whether NULL or not. This alternative semantics can be easily accommodated, by using left equi-joins instead of the default (inner) equi-joins:
According to the definition of the left join, these Natural Algebra expressions with return NULL values in those result rows where there is no object of class A with oid matching B.y attribute (or no object of class C with oid attribute matching the respective B.y.x, for the case of Q5).
For the embodiments that perform a translation to the Natural Algebra in a schema-ignorant way (which was possible in all the examples so far), translating a property chain directly to a join is not possible. An intermediate translation can be formed instead:
The last two queries from the Example 23 can be first translated as
These translations use a property dereference [⋅] operator applied to a certain property in a projection. Now we can formulate a transformation rule translating every occurrence of [⋅] to an equi-join, thus making [⋅] a derived operator:
Where B is any Natural Algebra expression, X is any header expression, p is a property and a is an attribute expression which may involve another instance of [⋅] operator. In case a is not a simple property, but has a form b[g], the condition Dom(a)∈A is defined as follows:
which effectively means that attribute q has A as its domain, and also belongs to the domain of b. This definition works recursively, since b is not required to be a single attribute either.
The first application of the rule R2 to Q5 with the following match:
would transform it to:
Indeed, the dereference operator can appear as part of any property used in a scalar expression defining a generated set, so we will need another rewrite rule for this case:
where a and b are property expressions, p is a single property, and X is the set of other attributes in the projection, B is a Natural Algebra expression with header containing attribute denoted by a, and C is the rest of the expression, which will be bound to the join identity element {*1} if there are no more terms in the join.
The value transformation function C(a[p]c→a.p.c)D obtains a value for rule variable D by exhaustively applying the given attribute expression transformation as a sub-rule to the binding of the variable C in the input pattern. Note that c bindings might be different in each application of this sub-rule, whereas a and p are bound by the host rule.
Applying rule R3 to the latest form of Q5 with the following match:
we obtain Q5(2) which is the translation of Q5 in the Example 23:
In case of the left join based translation of the property chains, as discussed in the previous note, rules R1-R5 from this section would instead introduce left equi-join operator in the output pattern.
Starcounter QL supports a full set of path expressions, borrowed from W3C SPARQL 1.1 Standard. The key semantic feature is the ability to return zero, one, or many values for a property. In Example 23 we were assuming that B.y attribute acts as a foreign key, i.e. it points to at most one existing object in Dom(y). However, such constraint is not required for the property chain syntax to work.
One example of multi-value property is a reverse property. A Starcounter QL query
The Natural Algebra translation of this query, given the presented schema, is
which corresponds to an executable S-expression in the preferred embodiment
Table (class) Company does not participate in the expression, although it is the domain of the employer attribute used in the equi-join.
As in the previous case, this translation can be obtained in two steps, (1) a straightforward schema-ignorant translation using the reverse property dereference {circumflex over ( )}[⋅] operator:
and (2) rule-based transformation for the {circumflex over ( )}[⋅] found among the projection attributes:
with the following match
resulting in the translation of the query based on the equi-join, as presented above.
Another case of a path expression producing 0-to-many results for each object instance is the syntax of alternative paths. Consider a slightly modified Person table
And a query returning all people who have company named ‘Starcounter’ as a current or previous (last) employer:
Is translated to a theta-join, having a disjunction in the join predicate:
which corresponds to an executable S-expression in the preferred embodiment (note the theta-join is ‘flattened’ to a natural join and a filter):
Again, in order to arrive to this translation, the type of the properties employer and lastEmployer needs to be known. An intermediate translation can be constructed in a schema-ignorant way first, using the alternative ∨ operator inside the property reference:
A transformation rule, aware of the types of alternative property paths would then simply translate the occurrences of the property references with ∨ operator inside to the theta-joins (or, depending on the chosen property chain semantics, to the left theta-joins) with disjunction-based join criterion.
Future versions of Starcounter QL may allow different object types of the alternative property paths, in which case the type of the property chain, or the handling of further properties along the property chain (as name in this case), if they only belong to some but not all alternative types, needs to be defined at that point.
The Natural Algebra expressions are capable of fully encoding database queries translated from a wide range of query languages. They can also be transformed to different forms using the transformation rules framework, including the introduction of physical operators into the expression.
Any embodiment should include an executable form of each valid Natural Algebra expression, and this form should be reachable with transformation rules available in the system. For example, the preferred embodiment based on S-expressions, as shown in examples in Section 2, provides a fully composable context-free execution of each Natural Algebra operator, hence any expression form is executable there.
A different embodiment might define a dedicated set of physical operators, partially or completely disjoint with the set of basic and derived Natural Algebra operators introduced so far. A Natural Algebra expression, once fully translated to use this set of operators, may be set to a compiler or interpreter subsystem, where, given access to the data in the database, computation of the result will take place.
Any such execution environment will either feature context-free execution (and thus require correlated subquery rewriting techniques such as predicate lifting and finite scaffolds) or facilitate passing of execution context to subexpressions in the form of bound attributes/variables.
For example, let's return to the query from Example 7, which was partially translated to use physical operator Rangey in Section 4.3:
More decisions on its execution may be taken by choosing an access method for A, for example Scan(A), and a particular implementation of the natural join, for example nested-loop join NLJ:
Even in context-free setting, a nested-loop join operator allows local passing of the attributes bound by the left operand into the right operand, and a the renaming operator ρ, which deals only with headers, is transparent to this context passing, hence A.x is ‘visible’ when executing Rangey.
6.2 Execution Model with Full Context Passing
In Example 14 we translated an SQL query with correlated aggregate subquery into a Natural Algebra expression without any finite scaffolds or lifted predicates:
A number of logical-to-physical transformations would further produce a query using range indexes on both stored k-relations, and a nested-loop join:
Executing this query will require passing the A.x bindings all the way into the subexpressions in the right operand of the nested-loop join. In particular, executing the grouping and aggregate operator Ξ with grouping variable bound would effectively produce zero or one group (result tuple) per binding.
One of the more straightforward ways to implement such execution model would be to
Number | Date | Country | Kind |
---|---|---|---|
1930357-7 | Nov 2019 | SE | national |
1951327-4 | Nov 2019 | SE | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/SE2020/051049 | 10/29/2020 | WO |