a) is a graph representation of a representative DTD.
b) is a simplified representation of
a) through 3(c) are exemplary
a) is an exemplary
b) is a four-cycle
a) through 5(h) are graphs depicting the processing time for cross cycles using a translation algorithm in accordance with the invention.
a) and 6(b) are graphs comparing the translation algorithm in accordance with the invention with existing algorithms.
a) through 7(d) are graphs depicting various
The present invention arises in the context of
A
α::=ε|B|α,α|(α|α)|α*,
where ε is the empty word or null set, B is a type in Ele (referred to as a subelement or child type of A), and ‘|’, ‘,’ and ‘*’ denote disjunction, concatenation and the Kleene star, respectively. The A→Rg(A) may be referred to as the production of A. For simplicity, attributes need not be considered here, and it is assumed that an element v may possibly carry a text value (
A
A
A dept
Consider a fragment of an
p::=ε|A|*|p/p|p//p|p∪p|p[q]
q::=p|text( )=c|q|q
q|q
q
where ε, A and * denote the self-axis, a label and a wildcard, respectively; ‘∪’, ‘/’ and ‘//’ are union, child-axis and descendants-or-self-axis, respectively; and q is called a qualifier, in which c is a constant, and p is the
The q] only appear in the form of p[text( )=c] and p[
q] where p is an
This class of
Consider Two
second one is to find courses that (1) have a prerequisite cs66, (2) have no project related to them or to their prerequisites, but (3) also have a student who registered for the course but did not take cs66. □
The present invention focuses on
To simplify the discussion it may be assumed that τd maps each element of type A to a relation RA in R, which has three columns F (from, i.e., parentId), T (to, i.e., ID) and V (value of all other attributes). Intuitively, in a database τd(Tr) representing an
With the shared-inlining technique, the
The query translation problem from
This section reviews the approach proposed by Krishnamurthy et al. in a paper entitled “Recursive XML Schemas, Recursive XML Queries, and Relational Storage: XML-to-SQL Query Translation” published in ICDE 2004—the only existing solution for the query translation problem in the presence of recursive
The algorithm of Krishnamurthy et al., referred to as SQLGen-R, handles recursive path queries over recursive
If a component ci is cyclic, the
R0←R
Ri←Ri−1∪(Ri−1c
c
where R0 corresponds to the initialization part, Rj corresponds to an
Recall the mapping from the dept
Observe the following about the query of Table 2. First, it actually requires a fixpoint operator that takes 4 relations as input. As remarked in Section 3, the functionality of φ(R, R1, R2, . . . Rk) is a high-end feature that few
To this end, the present invention provides a new approach to translating
Regular
E::=ε|A|E/E|E∪E|E*|E[q],
q::=E|text( )=c|
q|qιq|q←q.
where A is an element type in D. The semantics of evaluating a regular
Regular
In short, the regular
The simple
R0←R
Ri←Ri−1∪(Ri−1cR0) (2)
where C is a Boolean expression on the join. The
To illustrate how the
R←ΠRR3
. . .
Rn
R1) (3)
Here, the projected attributes are taken from the attributes F (from) and T (to) in relations R2 and R1, respectively. The join between Ri/Rj is expressed as RiR
In contrast to the
R0←R
Ri←Ri−1∪(Ri−1R′)
where R′=∪j=1kRj. But this is incorrect, because different conditions are associated with different joins in Eq. (1).
Based on the
Suitable translation algorithms are provided below in Sections 6.3 and 6.4. These algorithms produce the equivalent regular
Consider again evaluating the
Rcc←Rc
Rcsc←ΠRR
ΠRR
Rcc∪Rcsc∪Rcpc
Φ(R)∪ΠT,T(Rc)
ΠRR
R
R
Contrast Example 6.5 with the
This section describes an embodiment of the first step of the invention—rewriting an
The algorithm, XPathToReg, exemplifying the first step described above of the method in accordance with the present invention, is based on dynamic programming. For each
In computing each local translation x2r(p, A), the algorithm evaluates sub-query p over the sub-graph of the
To conduct the dynamic-programming computation, the XPathToReg algorithm uses the following variables. First, it constructs a list L that is a postorder enumeration of the nodes in the parse tree of sub-query p, such that all of the sub-queries of sub-query p (i.e., its descendants in sub-query p's parse tree) precede sub-query p in enumerated list L. Second, it puts all the element types of the DTD D in an element list N. Third, for each sub-query p in enumerated list L and each node A in element list N, the expression x2r(p, A) denotes the translated regular sub-query (or local translation) of sub-query p at each node A, which is a regular
q2]: if reach(q1, A) ≠ and reach(q2, A) ≠
x2r([q2], A)];
q2]: if reach(q1, A) ≠ and reach(q2, A) ≠
x2r([q2], A)];
q]: if reach(q, B) = for all B ∈ reach(p′, A)
x2r([q], A)];
denote the regular expression representing all the paths from node A to node B in graph GD, such that the expression rec(A, B) is preferably equivalent to the
In one embodiment, the expressions rec(A, B) and reach(ε//, A) over a recursive
Tarjan's fast algorithm takes O(|D| log |D|) time, and thus so is the size of rec(A, B). Note that rec(A, B) is determined by the DTD D regardless of the input query Q; thus it can be precomputed for each A, B, once and for all, and made available to XPathToReg.
Section 6.3.2 below presents an alternative algorithm for computing the expression rec(A, B).
Also of note is the special query , which returns an empty set over any XML tree, as described in Section 6.1. In the present translation algorithm, the query is used for optimization purposes. Further, unnecessary occurrences of the null set operator ε in the input query Q, are eliminated by means of rules p/ε=ε/p=p and p[ε]=p.
Algorithm XPathToReg is given in Table 4. It computes EQ=x2r(Q, r) as follows. It first enumerates (a) the list L of sub-queries p in input query Q and (b) the list N of element types in D, and initializes the values of function x2r(p, A) to the special query and reach(p, A) to empty set for each pεQ and each element type AεN (lines 1-6). Then, for each sub-query p in list L in the topological order and each element type A in list N, it computes the local translation x2r(p, A) (lines 7-63), bottom-up starting from the inner-most sub-query of Q. To do so, it first computes local translation elements x2r(pi, Bj) for each immediate sub-query pi of p at each possible
As seen from the algorithm itself, the details of this combination are determined based on the formation of sub-query p from its immediate sub-queries pi, if any (cases 1-12). In particular, in the case p=ε//p1 (case 5), the algorithm ranges over the children C of A to compute rec(C, _) instead of rec(A, _) since the context node A is already in the latter, where ‘_’ denotes an arbitrary type.
The special case that arises when the immediate sub-query p1 is of the form B/p′ is handled by using rec(C, B)/x2r(p′, B). Note that when sub-query p is a qualifier [q] (cases 7-12), it may evaluate the qualifier [q] to a truth value (ε for true and for false) in certain cases based on the structure of the
At the end of the iteration, the algorithm obtains the regular equivalent
Recall the
E
Q
=dept/course[EcourseEcourse
takenBy/student/Equalified
where the following is computed by Tarjan's fast algorithm:
E
course
course=rec (course, course)=course/E1*∪E2+/E1*,
E
course
project
=rec (course, project)=(course/E1*∪E2+/course/E1*)/project,
E
qualified
course
=rec(qualified, course)=qualified/course/E1*∪(qualified/E2)+/course/E1*,
E1=prereq/course∪takenBy/student/qualified/course
E2=course/E1*/project/required
Algorithm XPathToReg takes at most O(|Q|*|D|3) time, since each step in the iteration takes at most O(|D|) time, except that Case 5 may take O(|D|2) time. The size of the list L is linear in the size of Q, and the expression rec(A, B) may be precomputed as soon as the
Algorithm XPathToReg has a number of highly advantageous characteristics. First, regular
A preferred criterion for computing a regular
In response to this, the inventors have developed a new algorithm for computing the regular expression rec(A, B), referred to as Algorithm Cycle-C, which is a heuristic for reducing, and preferably minimizing, the number of Kleene closures in a resulting regular
Algorithm Cycle-C is based on the idea of graph contraction: given a
As an example, assume that all the AB-paths are L1, . . . , Ln, where each Li is of the form A1→ . . . →Ak, with A=A1 and B=Ak. Algorithm Cycle-C encodes each path Li with a regular expression Ei, which has an initial value A1/ . . . /Ak. Then, for each simple cycle Cj “connected” to Ai, the algorithm encodes the cycle Cj with a simple regular expression EC
Below are discussed the various cases dealt with by the Cycle-C algorithm, starting from simple ones.
First, assume that AiεGD is the only node shared by L and C=Ai→A′1→ . . . →A′m→Ai. Then, the regular expression E=Ea/Eγ/Eb captures all the paths between A and B, where Ea=A1/ . . . /Ai, Eb=Ai+1/ . . . /Ak, and Eγ is EC* with EC=A′1/ . . . /A′m/Ai.
Second, suppose that L and cycle C share more than one node, say, nodes Ai and Aj. In this case, cycle C only needs to be incorporated into E at one of those nodes, either at node Ai or node Aj, because Eγ has already covered the connections between nodes Ai and Aj. Thus regular expression E is the same as the one given above. This property allows us to find Eγ using an arbitrary node Ai shared by multiple simple cycles.
Case-2. There exist a single AB-path L and multiple simple cycles C1, . . . , Cn, while all these cycles share a single node Ai on L. Here the regular expression E is a mild extension of case-1: E is Ea/Eγ/Eb while Eγ=(EC
A case similar to Case 2 was given in Example 6.5. Consider the expression Rd//Rp over the
Case-3. There exist a single AB-path L and multiple simple cycles C1, . . . , Cn, but not all the cycles share a node on L. For example,
Observe the following. First, Eγ2 covers all possible paths that traverse Eγ1 since Eγ2 includes Eγ1 by replacing a with Eγ1, and E covers all possible paths between a and c. Second, the processing order of the cycles is not sensitive. One may first process C2 and C3 and obtain Eγ2, and then let Eγ1 include Eγ2 by replacing c with Eγ2.
Case-4. There are multiple AB-paths.
Case-5. There are a single AB-path L and multiple simple cycles, but not all cycles are directly connected to path L. For example,
Putting these cases together, the Cycle-C algorithm is presented in Table 5. It takes as inputs a
More specifically, the Cycle-C algorithm first identifies all the AB-paths L1, . . . , Ln in GD and for each path Li, finds the subgraph Gi that consists of that path Li along with all the simple cycles that are connected to that path Li, directly or indirectly (lines 1-2). The simple cycles Ci connected to each path Li are preferably determined using a known algorithm such as that described by H. Weinblatt in his article entitled “A New Search Algorithm for Finding the Simple Cycles of a Finite Directed Graph,” JACM 19(1):43-56, 1972. Second, after determining the simple cycles Ci connected to a given path Li, the Cycle-C algorithm then topologically sorts these cycles based on their shortest distance to any node on the path Li(line 6). Third, for each of these cycles starting from the one with the longest distance to Li, it contracts the cycle based on case-5 above (lines 4-12). Fourth, it identifies
all Aj nodes shared by some simple cycles (line 13) with path Li, and contracts those simple cycles to a single node based on cases 1-3 above (lines 14-16). Finally, it produces and returns the resulting regular expression based on case 4 above (line 17). Advantageously, the resulting regular expression rec(A, B) returned by algorithm Cycle-C captures all and only the paths between nodes A and B in
Recall the regular
E
course
course=course/Ecc,
E
course
project=course/Ecc/project,
E
qualified
course=qualfied/course/Ecc,
E=(E1∪project/required/course)*,
E1 is the same as the one given in Example 6.6.
This section describes an algorithm embodying the second step of the present invention as described above, namely, rewriting regular
An algorithm for rewriting regular
An issue that arises with this approach is that the Rid=Rid
R=R for any relation R. With this assumption, the expression (E)* may be translated to Φ(R)∪Rid, where R codes E and Rid tuples will be eliminated at a later stage. To simplify the presentation of the translation algorithm, null set ε is re-written here into Rid. In practice, other more efficient translations may be used in accordance with known techniques.
The translation algorithm RegToSQ L for rewriting regular
, where τ : D →
.
R
R
q2]: let R1 = r2s(q1); R2 = r2s(q2);
q2]: let R1 = r2s(q1); R2 = r2s(q2);
q]: let Rq = r2s(q), R1 = r2s(e1);
R
in Table 6. The algorithm receives a regular
The algorithm is based on dynamic programming: for each sub-expression e of regular
More specifically, the algorithm first finds the list L of all sub-expressions of regular
(1) A label A in terms of the relation RA (case 2).
(2) Concatenation ‘/’ with projection Π and join(case 3).
(3) Union and disjunction with union ∪ in relational algebra (cases 4, 10).
(4) Kleene closure (E)* with the LFP operator φ (case 5).
(5) e1[q] is converted to a relational algebra query r2s(e) that returns only those r2s(e1) tuples t1 for which there exists a r2s(q) tuple t2 with t1.T=t2.F, i.e., when the qualifier q is satisfied at the node represented by t1.T (case 6). On the other hand, the algorithm rewrites e1[q] to a relational algebra query r2s(e) that returns only those r2s(e1) tuples t1 for which there exists no r2s(q) tuple t2 such that t1.T=t2.F, i.e., when the qualifier q is not satisfied at the node t1.T (and hence [
q] is satisfied at t1.T; case 11); this captures the semantics of negation in XPATH (recall the assumptions about [
q] and [text( )=c] set forth in Section 6.1 above).
(6) [e1] is rewritten into r2s(e1) (case 7).
(7) e1[text( )=c] in terms of selection σ that returns all tuples of r2s(e1) that have the text value c (case 8).
(8) Conjunction q1q2 in terms of set intersection implemented with union U and set difference \ in relational algebra (case 9).
In each of the cases above, the list Q′ is incremented by adding Re←r2s(e) to Q′ as the head of Q′ (line 24).
Finally, after the iteration, the algorithm yields πTσF=‘
Recall the
Ecc: Rγ with
EcourseRγ,
EcourseRγ
Rp,
Equalified
EcourseRc)
takenBy/student/EqualifiedRqc)
Note that Q2 is of the form (with a complex qualifier) dept/course[q1q2
q3], which is handled by our algorithms by treating it as Q21=dept/course[q1], Q22=Q21[
q2] and Q2=Q22[
q3]. Thus Q21←Rd
Rc
R1, Q22←Q21\(Q21
Rcp), and EQ
R2) where projections are omitted. In contrast, the algorithm of Krishnamurthy et al. cannot translate
It can be verified that algorithm RegToSQL takes at most O(|EQ|) time. As such, it will be understood that the present invention, comprising the steps set out in algorithms XPathToReg and RegToSQL, provides a method for rewriting each
Observe the following. First, algorithm RegToSQL shows that the simple
Algorithms XPathToReg and RegToSQL show that
In particular, selections may be pushed into R2 yields the right answer, the performance may be improved by pushing the selection into the
Similarly, the selection in Rd//Rc/Rp[id=c] can be pushed into
To verify the effectiveness of the rewriting and optimization algorithms presented above, the inventors evaluated
The present inventors experimented with these algorithms using (a) a simple yet representative
Implementation. The inventors implemented a prototype system supporting SQLGen-R, Cycle-E and Cycle-C, using Visual C++, denoted by R, E and C in the figures, respectively. Rewritten SQL queries were executed in a batch. This prototype system included only certain basic optimizations, e.g., common sub-expressions were executed only once. Experiments were conducted using IBM DB2 (UDB 7) on a single 2 GHz CPU with 1 GB main memory. The queries output ancestor-descendant pairs.
Testing Data: Testing data was generated using IBM
Relational Database. Once generated, the
Query Evaluation. (1) Four
For the simple cross-cycle
Qα=α/b//c/d (with //),
Qb=α[ε//c]//d (a twig join query),
Qc=α[ε//c] (with
and //), and
Qd=α[ε//c
(b
ε//d)] (with
, 77 ,
and //).
The XPathToReg algorithm rewrites these queries into four Ea,b/c], and Q′d=α[
Ea,b/c
(b
Ea,c/d)], while the Cycle-E algorithm generates:
Eb,c=rec(b,c)=(Ebb∪(Ebb/c/α/(Ebb/c/a)*/Ebb))/c
Ea,b=rec(α,b)=α/(Ebb/c/α)*/Ebb
Ea,c=rec(α,c)=α/(Ebb/c/α)*/Ebb/c
Ebb=b/(c/d/b)*
In contrast, Cycle-C generates the following:
Eb,c=rec(b,c)=b/(c/α/b∪c/d/b)*/c,
Ea,b=rec(α,b)=α/b/(c/α/b∪c/d/b)*,
Ea,c=rec(α,c)=α/b/(c/α/b∪c/d/b)*/c.
For each expression rec(A,B), the Cycle-C algorithm uses one
These tests used an
Two
b) demonstrates the scalability of the algorithms described herein by increasing the dataset sizes, foe an
All these
and 236,260, respectively.
As shown in
These has been provided a new approach to translating a practical class of
Although the invention has been described in language specific to