The present invention discloses a method and system for generating nested mapping specifications in a schema mapping formalism and for generating transformation queries based on the nested mapping specifications.
Declarative schema mapping formalisms have been used to provide formal semantics for data exchange, data integration, peer data management, and model management operators such as composition and inversion. For relational schemas, widely used known formalisms for schema mappings are based on source-to-target tuple-generating dependencies (source-to-target tgds) or, equivalently, global-and-local-as-view (GLAV) assertions. Known direct extensions to schema mapping formalisms exist for schemas (e.g., eXtensible Markup Language (XML) schemas) containing nested data. These conventional formalisms provide inaccurate or underspecified mappings. Further, conventional mapping specifications generated under these known formalisms are fragmented into many small, overlapping formulas where the overlap may lead to redundant computation, hinder human understanding of the mappings, and/or limit the effectiveness of mapping tools. Thus, there exists a need to overcome at least one of the preceding deficiencies and limitations of the related art.
In first embodiments, the present invention provides a computer-implemented method of generating nested mapping specifications, the method comprising:
receiving, by a computing system, one or more source schemas, a target schema, and one or more correspondences between one or more elements of each source schema of the one or more source schemas and one or more elements of the target schema;
generating, by the computing system, a set of basic mappings based on the one or more source schemas, the target schema, and the one or more correspondences;
constructing, by the computing system, a directed acyclic graph (DAG) whose edges represent all possible ways in which each basic mapping of the set of basic mappings is nestable under any other basic mapping of the set of basic mappings;
removing, by the computing system, any transitively implied edges from the DAG;
identifying, by the computing system and subsequent to the modifying, one or more root mappings of the DAG; and
extracting, automatically by the computing system, one or more trees of mappings from the DAG, each tree of mappings being rooted at a root mapping of the one or more root mappings and each tree of mappings expressing a nested mapping specification.
In second embodiments, the present invention provides a computer-implemented method of generating a transformation query from a nested mapping specification based on a source schema and a target schema, the method comprising:
generating, by a computing system, a first-phase query for transforming source data into a set of flat views of the target schema; and
generating, by the computing system, a second-phase query as a wrapping query for a nesting of data of the flat views according to a format of the target schema,
wherein the generating the first-phase query includes:
Systems and computer program products corresponding to the above-summarized methods are also described herein.
Advantageously, the present invention provides a nested mapping formalism and technique for generating nested mapping specifications and transformation queries based thereon that permit the expression of powerful grouping and data merging semantics declaratively within the mapping. Further, the nested mapping formalism described herein yields more accurate specifications, and when used in data exchange, improves the quality of exchanged data (e.g., reduces redundancy in the target data) and drastically reduces the execution cost of producing a target instance. The nested mappings described herein naturally preserve correlations among data that existing mapping formalisms cannot. Still further, the nested mapping formalism provides an ability to express, in a declarative way, grouping and data merging semantics that are easily changed and customized to any particular integration task. Further yet, the transformation query generation technique described herein scale well to large, highly nested schemas.
Many problems in information integration rely on specifications that model relationships between schemas. These specifications, called schema mappings, play a central role in both data integration and in data exchange. Considered herein are schema mappings over pairs of schemas that express a relation on the sets of instances of two schemas. Presented herein is a new formalism for schema mapping that extends existing formalisms in two significant ways. First, nested mappings allow for nesting and correlation of mappings. Second, the extension to the mapping formalism includes an ability to express, in a declarative way, grouping and data merging semantics. Further, the present invention includes a new algorithm for an automatic generation of nested mapping specifications from schema matchings (i.e., simple element-to-element correspondences between schemas). Still further, the present invention includes the implementation of this algorithm, along with algorithms for the generation of transformation queries (e.g., XQuery) based on nested mapping specifications.
Source-to-target tgds and GLAV assertions are constraints between relational schemas. They are expressive enough to represent, in a declarative way, many of the relational schema mappings of interest. This section examines an extension of source-to-target tgds designed for schemas with nested data that is based on path-conjunctive constraints, and that have been used in systems for data exchange, data integration, and schema evolution. Such mappings are referred to herein as basic mappings. These mappings form the basic building blocks for the nested mappings discussed below. In related literature, these basic mappings have sometimes been referred to as nested constraints or dependencies, since they are constraints on nested data. The mappings themselves, however, have no structure or nesting. Hence, the present application uses the term “basic” to distinguish these mappings from the more structured nested mappings that are discussed below. The basic mappings referred to herein are the logical mappings described in U.S. Patent Application Publication No. 2004/0199905 A1 (Fagin et al., “System and method for translating data from a source schema to a target schema”), which is hereby incorporated herein by reference in its entirety. The basic mappings referred to herein are also the mappings described in U.S. patent application Ser. No. 11/343,503.
To illustrate the use of basic mappings, consider mapping example 100 shown in
The formulas that are presented below the schemas in example 100 are examples of basic mappings. The formulas are constraints that describe, in a declarative way, the mapping requirements. These formulas may be generated by a tool from the correspondences between schema elements, or may be written by a human expert and interpreted by a model management tool or other integration tools. Section 2 provides a precise semantics for the schema and basic mapping notation.
Each formula (i.e., each mi) in example 100 addresses one possible “case” in the source data, where each case is expressed by a conjunction of navigation paths joined in certain ways. In order to cover all possible cases of interest, many such formulas are needed. However, many of the cases overlap (i.e., have common navigation paths). Hence, common mapping behavior must be repeated in many formulas. For example, the formula m2 must repeat the mapping behavior that m1 already specifies for department data, although m2 includes the mapping behavior for department data in a more specialized context. Otherwise, if only the mapping behavior for employees is specified in m2, the association that exists in the source between employees and their departments is lost in the target since there is no correlation between m1 and m2. At the same time, m1 cannot be eliminated from the specification, since m1 deals with departments in general (i.e., departments that are not required to have employees). Also, in example 100, m3 and m4 include a common mapping behavior for employees and departments, but m3 and m4 differ in that they map different components of employees: dependents and projects.
Such formulas are relatively easy to generate and reason about. This is partly why they have been widely used in research. However, the number of formulas quickly increases with large schemas, leading to an explosion in the size of the specification. This explosion as well as the overlap in behavior causes significant usability problems for human experts and for tools using these specifications in practice.
Inefficiency in execution: In a naive use of basic mappings, each mapping formula may be interpreted separately. Optimization of these mappings requires sophisticated techniques that deduce the correlations and common subexpressions within the mappings.
Redundancy in the specification: When using basic mappings in data exchange, the same piece of data may be generated multiple times in the target due to the multiple formulas. In addition to possible run-time inefficiency, this multiple generation of the same piece of data puts additional burden on methods for duplicate elimination or data merging. In example 100, an employee may be generated three times in the target: once for m2 with an empty set of dependents and an empty set of projects, once for m3 with a non-empty set of dependents and once for m4 with a nonempty set of projects. Merging of the three employee records into one is more than just duplicate elimination: it requires merging of two nested sets as well. Furthermore, this raises the question of when to merge in general since this is not expressed in any way by the mapping formulas of
Underspecified grouping semantics: The formula m2 requires that for every department and for every employee record in the source there must exist, in the target, a “copy” of the department record with a “copy” of the employee record nested underneath. However, it is left unspecified whether to group multiple employees who are common for a given department name (dname), or whether to group by other fields, or whether not to group at all. Again, one of the reasons for this lack of expressive power is the simplicity of these basic mapping formulas. A known default grouping behavior is based on partitioned normal form (PNF) which always groups nested sets of elements by all the atomic elements at the upper levels. Under PNF semantics in example 100, employees are grouped by dname and location, assuming that budget is not mapped and its value is null. In effect, the semantics of the transformation is specified in two parts: first the mapping formulas, and then the implicit PNF-based grouping semantics. An important limitation of this approach is that the default grouping semantics is not specified declaratively, and it cannot be easily changed or customized when it is not the desired semantics.
In order to address the issues described in Section 1.1, the present invention includes an extension to basic mappings that is based on arbitrary nesting of mapping formulas within other mapping formulas. This formalism is referred to herein as the language of nested mappings. Nested mappings offer a more natural programming paradigm for mapping tasks, since human users tend to design a mapping from top to bottom, component-wise: define first how the top components of a schema relate, then define, recursively, via nested submappings, how the subcomponents relate, and so on. The nested mapping corresponding to example 100 is illustrated by nested mapping 200 in
Advantages of nested mappings: To a large extent, nested mappings overcome the aforementioned shortcomings of basic mappings. First, fewer formulas are needed and overall a more natural and accurate specification is produced. For the corresponding examples shown in
Nested mappings also have a natural, built-in, grouping behavior that follows the grouping of data in the source. For example, the nested mapping in
Summary of Contributions: The present invention includes a nested mapping formalism for representing the relationship between schemas for relational or nested data (see Section 2). Further, an algorithm for generating nested mappings from matchings (i.e., correspondences) between schema elements is described herein. The nested nature of the mappings makes this generation task more challenging than in the case of basic mappings (see Section 3). Still further, the present invention includes an algorithm for the generation of data transformation queries that implement data exchange based on nested mapping specifications. Notably this algorithm for generating transformation queries can handle all nested mappings, including those generated by the mapping algorithm described herein as well as arbitrary customizations of these mappings. Such customizations of mappings are made, for example, by a user to capture specialized grouping semantics (see Section 4). Further yet, the description that follows illustrates experimentally that the use of nested mappings in data exchange drastically reduces the execution cost of producing a target instance, and also dramatically improves the quality of the generated data. Examples of important grouping semantics that cannot be captured by basic mappings and an empirical showing that underspecified basic mappings may lead to significant redundancy in data exchange are shown below (see Section 5).
This section describes the notation and terminology for schemas and mappings. Further, qualitative differences between basic mappings and nested mappings are described in detail.
Consider the mapping scenario 300 illustrated in
Formally, a schema is a set of labels (a.k.a. the roots of the schema or schema roots), each with an associated type τ, defined by τ::=Str|Int|SetOf τ|:[l1:τ1, . . . ln:τn], where l11, . . . , ln are labels.
Logic-based notation: Alternatively, a “logic-based” notation is used for mappings that quantify each individual component in a record as a variable. In particular, nested sets are explicitly identified by variables. Each mapping is an implication between a set of atomic formulas over the source schema and a set of atomic formulas over the target schema. Each atomic formula is of the form e(x1, . . . , xn) where e denotes a set, and x1, . . . , xn are variables. For simplicity of presentation, a strict alternation of set and record types in a schema is assumed herein. The main difference from the traditional relational atomic formulas is that e may be a top-level set (e.g., proj), or it may be a variable in order to denote sets that are nested inside other sets. As presented in formulas herein, the atomic variables are written in lower-case and the set variables in upper-case. The formulas corresponding to the mappings m1 and m2 of
m1:proj(d,p,Es)→dept(d,?b,?E,?P)P(?x,p) (1)
m2:proj(d,p,Es)Es(e,s)→dept(d,?b,?E,?P)E(e,s,?P′)P′(?x)P(x,p) (2)
For each of the formulas (1) and (2) presented above, the variables on the left of the implication are assumed to be universally quantified. In formulas (1) and (2), the variables on the right of the implication that do not appear on the left of the implication are assumed to be existentially quantified. For clarity, the quantifiers are omitted and a question mark is used in front of the first occurrence of an existentially-quantified variable.
For example, in m2 (i.e., formula (2) presented above), the variable Es denotes the nested set of employee records inside a tuple in the top-level set proj. The variables E, P, and P′ are also set variables, but existentially quantified. The variables b (i.e., denoting budget) and x (i.e., denoting project id) are existentially quantified as well, but are atomic. The meaning of m2 is: for every source tuple (d, p, Es) in proj, and for every tuple (e, s) in the set Es, there must exist four tuples in the target as follows. First, there must be a tuple (d, b, E, P) in dept, where b is some “unknown” budget, E identifies a set of employee records, and P identifies a set of project records. Then, there must exist a tuple (e, s, P′) in E, where P′ identifies a set of project ids. Furthermore, there must exist a tuple (x) in P′, where x is an “unknown” project id. Finally, there must exist a tuple (x, p) in the previously mentioned set P, where x is the same project id used in P′. Notice that all data required to be in the target by the mapping satisfies the foreign key for the projects.
In this section, actual data is presented in order to provide an understanding of the semantics of basic mappings, and to see why such specification language is not entirely satisfactory. In example 400 in
In general, for a given source instance, there may be several target instances satisfying the constraints imposed by the mapping specification. Given the specification {m1,m2}, the target instance shown in
A target instance 500 that is more “desirable” is shown in
The present invention provides a specification that “enforces” correlations such as the ones that appear in the more “desirable” target instance (e.g., that the two source employees appear in the same set in the target). In particular, it would be advantageous to correlate the mapping m2 with m1 so that it reuses the set id E for employees that is already asserted by m1 along with other existentially-quantified elements in m1, without repeating the common part, which is m1 itself. This correlating of the mapping m2 with m1 is done using the following nested mapping:
n:proj(d,p,Es)→[dept(d,?b,?E,?P)P(?x,p)[E(e,s)→E(e,s,?P′)P′(x)]] (3)
The inner implication in n (i.e., the third line of the nested mapping (3) shown above) is a submapping. The rest of n is referred to as the outer mapping. The submapping is correlated to the outer mapping because it reuses the existential variables E and x. In particular, the submapping requires that for every employee tuple in the set Es, where Es is bound by the outer mapping, there must exist an employee tuple in the set E, which is also bound by the outer mapping. Also, there must exist a project tuple in the set P′ associated to this employee, and the project id must be precisely the one (i.e., x) already required by the outer mapping. Note that P′ is now existentially quantified and bound in the inner mapping.
A fundamental observation about the nested mapping n shown above is that the “undesirable” target instance of
Another important observation is that there is no set of basic mappings that is equivalent to the nested mapping (3) shown above. Thus, the language of nested mappings is strictly more expressive than that of basic mappings.
Finally, a query-like notation (4) for the nested mapping (3) is presented below. Notice that the variables p, d′ and p′ from the outer level are being reused in the inner level.
As seen in the example presented in Section 2.2, nested mappings can take advantage of the grouping that exists in the source, and require the target data to have a similar grouping. In the example of Section 2.2, all the employees that are nested inside one source tuple are required to be nested inside the corresponding target tuple. This section shows how a restricted form of Skolem functions can be used to model groupings of data that may not be present in the source.
To illustrate, consider again the source schema in
Suppose now that all projects of a department are to be grouped into one set. Similarly, all the projects for each employee in a department are to be grouped into one set. Also, all the employees for a given department are to be merged. To generate such new groupings of data, an addition to the specification is required, since nesting of mappings alone is not flexible enough to describe such groupings. The mechanism added to the specification is that of Skolem functions for set elements. Intuitively, such functions express that certain sets in the target must be functions of certain values from the source. For the example presented above, to express the desired grouping, the nested mapping is enriched with three Skolem functions for the three nested set types in the target, as follows:
The new mapping constrains the target set of projects to be a function of only department name: P[p.dname]. Also, there must be only one set of employees per department name, E[p.dname], meaning that multiple sets of employees for different source tuples with the same department name must be merged into one set. Similarly, all projects of an employee in a department must be merged into one set.
More concretely, for the source tuple proj(CS, usearch, E0) of
The following natural restriction should be noted: The for clause of a submapping can use a correlation variable (i.e., bound in an upper-level mapping) only if that variable is bound in a for clause of the upper-level mapping. A similar restriction holds for the usage of correlation variables in exists clauses.
Using the logic-based notation, every nested mapping having no explicit Skolem functions is equivalent to one in which default Skolem functions are assigned to all the existentially-quantified set variables. The default arguments to such Skolem functions are all the universally quantified variables that appear before the set variable.
As an example, the aforementioned nested mapping n is equivalent to one in which the target set of projects nested under each dept tuple is determined by a Skolem function of all three components of the input proj tuple (i.e., dname, pname, and emps). In other words, there must be a set of target projects for each input proj tuple. Of course, this set of target projects needed for each input proj tuple does not require any grouping of projects by departments. However, once exposed to a user, the Skolem functions can be customized in order to achieve different grouping behavior, such as the one seen with the earlier mapping n′. The approach followed by the present invention is: first generate nested mappings with no Skolem functions, and then apply default Skolemization, which can then be altered in a GUI by a user.
Skolem functions and data merging: The example presented in Section 2.3 illustrates how one occurrence of a Skolem function permits data to be accumulated into the same set. Furthermore, the same Skolem function may be used in multiple places of a mapping or even across multiple mappings. Thus, different mappings correlated via Skolem functions may contribute to the same target sets, effectively achieving data merging. This is a typical requirement in data integration. Hence, Skolem functions are a declarative representation of a powerful array of data merging semantics.
As an interesting example of a set being shared from multiple places, consider the case when “Alice” has different salaries (i.e., 120K and 130K) in the two tuples in the source of
This section describes an algorithm for the generation of nested mappings. Given two schemas, a source and a target, and a set of correspondences between atomic elements in the schemas, the algorithm generates a set of nested mappings that best reflects the given schemas and correspondences. Section 3.1 includes the first two steps in an algorithm for generating basic mappings. Section 3.2 describes an additional step in which unlikely basic mappings are pruned. This pruning significantly reduces the number of basic mappings. Section 3.3 defines when a basic mapping can be nested under another basic mapping. The pruned basic mappings are then input to the final step in the algorithm to generate nested mappings (see Section 3.4).
In step 728, the nested mapping generator constructs a directed acyclic graph (DAG) that represents all possible ways in which the basic mappings remaining after the step 726 pruning can be nested under other basic mappings. In step 730, the nested mapping generator identifies root mappings of the DAG constructed in step 728. In step 732, the nested mapping generator extracts a tree of mappings from the DAG for each root identified in step 730. Each extracted tree becomes a separate nested mapping in an outputted nested mapping specification 710. The process of
Steps 728, 730 and 732 are further described in the subsection presented below that is entitled “Step 4. Generation of nested mappings” in Section 3.4. Details of steps 728, 730 and 732 are also included in the nested mapping generation process of
This section reviews the generation algorithm for basic mappings. The main concept is that of a tableau. Tableaux are a way of describing all the basic concepts and relationships that exist in a schema. As used herein, a concept is defined as a category of data that can exist in a schema. There is a set of tableaux for the source schema and a set of tableaux for the target schema. Each tableau is primarily an encoding of one concept of a schema. In addition, each tableau includes all related concepts; that is, concepts that must exist together according to the referential constraints of the schema or the parent-child relationships in the schema. This inclusion of all related concepts allows the subsequent generation of mappings that preserve the basic relationships between concepts. Such preservation is one of the main properties of the basic mapping generation algorithm, and continues to apply to the new algorithm for generating nested mappings.
Step 1. Computation of tableaux: This step is also referred to herein as step 722 of
In addition to the structural constraints (i.e., parent-child) that are part of the primary paths, the computation of tableaux also takes into account the integrity constraints that may exist in schemas. For the example in Section 3, the target schema includes the following constraint, which is similar to a keyref in an XML Schema: every project id of an employee within a department must appear as the id of a project listed under the department. This constraint is explicitly enforced in the tableau B3 in
The tableau B3 encodes that the concept of a project-of-an-employee-of-a-department requires the following concepts to exist: the concept of an employee-of-a-department, the concept of a department, and the concept of a project-of-a-department.
For each schema, the set of its tableaux is obtained by replacing each primary path with the result of its chase, with all the applicable integrity constraints. For the example in Section 3, only one primary path is changed by the chase (i.e., changed into B3). The rest remain unchanged, since no constraints are applicable. For each tableau, for mapping purposes, all the atomic type elements that can be referred to from the variables in the tableau are considered. For example, B3 includes dname, budget, ename, salary, pid, and pname. Such elements are referred to herein as being covered by the tableau. As used herein, generators are the variable bindings that appear in a tableau. Thus, a tableau consists of a sequence of generators and a conjunction of conditions. Note that only one pid is included, since p.pid is equal to p′.pid.
Step 2. Generation of basic mappings: In the second step of the algorithm (i.e., step 724 of
Every triple (A, B, V) encodes a possible basic mapping: the for and the associated where clause are given by the generators and the conditions in A, the exists clause is given by the generators in B, and the subsequent where clause includes all the conditions in B along with conditions that encode the correspondences (i.e., for every v in V, there is an equality between the source element of v and the target element of v). Herein, the basic mapping represented by (A, B, V) is written as ∀A→∃B.V, with the meaning described above. For the example in Section 3, the basic mapping ∀A1→∃B4.{d, p} is precisely the mapping m1 of
Among all the possible triples (A, B, V), not all of them generate actual mappings. A basic mapping is generated only if it is not subsumed and not implied by other basic mappings. This optimization procedure is described in Section 3.2.
The following concept of subtableau plays an important role in reasoning about basic mappings, and in particular in pruning out unlikely mappings during generation (see Step 3 presented below). The same concept also turns out to be very useful in the subsequent generation of nested mappings.
DEFINITION 3.1. A tableau A is a subtableau of a tableau A′, denoted by A≦A′, if (1) the generators in A form a superset of the generators in A′, possibly after some renaming of variables and (2) the conditions in A are a superset of the conditions in A′ or the conditions in A imply the conditions in A′, modulo the renaming of variables. Herein, A is referred to as a strict subtableau of A′ with the notation A<A′ if A≦A′ and the generators in A form a strict superset of the generators in A′.
For each schema, the subtableau relationship induces a directed acyclic graph of tableaux, with an edge from A to A′ whenever A≦A′. Such a graph can be seen as a hierarchy where the tableaux that are smaller in size are at the top. The tableaux at the top correspond to the more general concepts in the schema, while those at the bottom correspond to the more specific ones. Although the subtableau relationship is reflexive and transitive, most of the time the “direct” subtableau edges are considered. For the example in Section 3, the two hierarchies with no transitive edges are shown in
Step 3. Pruning of basic mappings: Step 3 (a.k.a. step 726 of
A basic mapping ∀A→∃B.V is subsumed by a basic mapping ∀A′→∃B′.V′ if A and B are respective subtableaux of A′ and B′, with at least one being strict, and V=V′. Note that if A and B are respective subtableaux of A′ and B′, then necessarily V includes V′ since A and B cover all the atomic elements that are covered by A′ and B′, and possibly more. The subsumption condition says that (A, B, V) should not be considered since it covers the same set of correspondences that are covered by the smaller and more general tableaux A′ and B′. For the example of
A basic mapping may be logically implied by another basic mapping. Testing logical implication of basic mappings can be done using the chase, since basic mappings are tuple-generating dependencies, albeit extended over a hierarchical model. Although in one embodiment, the chase is used for completeness, in another embodiment a simpler test suffices: a basic mapping m is implied by a basic mapping m′ whenever m is of the form ∀A→∃B.V and m′ is of the form ∀A→∃B′.V ‘and B’ is a subtableau of B. All the target components, with their equalities, that are asserted by m are asserted by m′ as well, with the same equalities. As an example, ∀A→∃B1.{d} is implied by ∀A1→∃B4.{d,p}.
Note that subsumption also eliminates some of the implied mappings. In the aforementioned definition of subsumption, in the particular case when B and B′ are the same tableaux, the subsumed mapping is also implied by the other one. For example, ∀A2→∃B1.{d} is subsumed and implied by ∀A1→∃B1.{d}.
The generation algorithm for basic mappings stops after eliminating all the subsumed and implied mappings. For the example in Section 3, only the two basic mappings, m1 and m2, remain from
This section provides a formal definition of the notion of a basic mapping being nestable under another basic mapping. This definition follows the intuition given in Section 2.2: m2 is nested inside m1 if m1 is “part” of m2; moreover, the nesting is done by factoring out the common part (i.e., m1) and adding the “remainder” of m2 as a submapping. Based on this definition, a graph (i.e., hierarchy) of basic mappings is constructed that will be used by the actual generation algorithm, which is described in Section 3.4.
DEFINITION 3.2. A basic mapping ∀A2→∃B2.V2 is nestable inside a basic mapping ∀A1→∃B1.V1 if the following conditions hold:
(1) A2 and B2 are strict subtableaux of A1 and B1, respectively,
(2) V2 is a strict superset of V1, and
(3) there is no correspondence v in V2−V1 whose target element is covered by B1.
For the example in Section 3, the basic mapping m2=∀A2→∃B3.{d, p, e, s} is nestable inside m1=∀A1→∃B4.{d, p}. In particular, A2 and B3 are strict subtableaux of A1 and B4; also, there are two correspondences in m2 but not in m1 (i.e., e and s) and their target elements are not covered by B4.
DEFINITION 3.3. Let m2=∀A2→∃B2.V2 be nestable inside m1=∀A1→∃B1.V1.
Without loss of generality, assume that all variable renamings have been applied so that the generators in A1 (B1) are literally a subset of those in A2 (B2). The result of nesting m2 inside m1 is a nested mapping of the form:
∀A1→∃B1.[V1∀(A2−A1)→∃(B2−B1).(V2−V1)]
where ∀(A2−A1)→∃(B2−B1).(V2−V1) is a shorthand for a submapping constructed as follows. The for clause contains the generators in A2 that are not in A1. The subsequent where clause, if needed, contains all the conditions in A2 that are not among and not implied by the conditions of A1. The exists clause and subsequent where clause satisfy similar properties with respect to B2 and B1. Finally, the last where clause also includes the equalities encoding the correspondences in V2-V1.
It can easily be verified that, for the example in Section 3, the result of nesting m2 inside m1 is precisely the nested mapping n. Next conditions (1) and (3) in Definition 3.2 are explained. Assume that m2 and m1 are as presented in Definition 3.2. The condition that A2 is a strict subtableau of A1 ensures that the for clause in the submapping that appears in the result of nesting m2 inside m1 is non-empty.
Assume now that B2 is not a strict subtableau of B1 and it is equal to B1. Note that the case when there are additional conditions in B2 does not affect this discussion. Then, the submapping that appears in the result of nesting of m2 inside m1 is a formula of the form: ∀(A2−A1)→(V2−V1) (i.e., the equalities on the right-hand side are implied by the left-hand side). There is at least one correspondence v in V2−V1, and its source element is not covered by A1; otherwise it would be in V1. Hence, in the right-hand side of the aforementioned implication, there is at least one equality asserting that a target element covered by B1 is equal to a source element covered by A2−A1. The problem with this is that there are many instances of such a source element for one instance of the target element, since B1 is outside the scope of V(A2−A1). This constraint would effectively require that all such instances of the source element be equal, and equal to the one instance of the target element. Such a constraint is unlikely to be desired, even when it is satisfiable. Although condition (3) of Definition 3.2 is a bit more subtle, a careful analysis yields a similar justification.
This discussion is illustrated by considering the reverse of the mapping scenario shown in
There are four basic mappings (i.e., not implied and not subsumed) that are generated by the algorithm described in Section 3.1. These mappings are shown in
This constraint says that if there are multiple projects in one dept tuple, which is possible according to the source schema, then all these projects are required to have the same pname value, which must also equal the pname value in the corresponding target proj tuple. This puts a constraint on the source data that is unlikely to be satisfied. In the nested mapping generation algorithm of the present invention, mappings such as n34 are not generated.
In the next step (i.e., Step 4) of the algorithm, the nestable relation of Definitions 3.2 and 3.3 is used to create a set of nested mappings. The input to Step 4 is the set of basic mappings that result after Step 3 (i.e., the set of basic mappings that remain after the pruning in step 726 of
Step 4. Generation of nested mappings: In this step (a.k.a. steps 728, 730 and 732 of
To understand the shape of G and the issues involved in its construction, the properties of the nestable relation of Definition 3.2 are examined herein. Given two basic mappings mi and mj, let mimj denote that mi is nestable inside mj. The following properties are noted:
(1) The nestable relation is not reflexive and not symmetric. In fact, stronger statements hold: (a) for all mi, mimi, and (b) if mimj, then mjmi. This property follows from the strict subtableaux requirement in condition (1) of Definition 3.2.
(2) The nestable relation is transitive: if mimj and mjmk then mimk. This property again follows from condition (1) of Definition 3.2 and, further, from conditions (2) and (3) of Definition 3.2.
Because of two properties described above, G is necessarily acyclic. If there is a path mimj in G, then no path mjmi exists in G. Condition (2) indicates that a naive algorithm for creating G might add too many edges and hence form unnecessary nestings. Indeed, suppose that mimj and mjmk, which also implies that mimk. Then mi can be nested under mj which can be nested under mk. At the same time, mi can be nested directly under mk. One embodiment prefers the former, deeper, nesting strategy because that interpretation preserves all source data together with its structure.
To illustrate this point, consider the mapping in
To implement the above nesting strategy, which performs the “deepest” nesting possible, the algorithm for constructing G makes sure not to include any transitively implied edges. More formally, the DAG G=(M, E) of mappings is constructed so that its set of edges satisfies the following:
E={(mi→mj)|mimj(∃mk)(mimkmkmj)}
The creation of G proceeds in two steps. First, in step 742 of
The next step is to extract trees of mappings from G. Each such tree becomes a nested mapping expression. These trees are computed in two simple steps. First, in step 748 of
Constructing nested mappings from a tree of mappings raises several issues. First, Definition 3.3 explained the meaning of nesting two basic mappings, one under the other. But, in a tree, one mapping can have multiple children that can each be nested inside the parent. Also, the definition must be applied recursively.
The second, more important issue is that, since these trees are extracted from a DAG, it is possible that they share mappings. In other words, a mapping can be nested under more than one mapping.
Consider, for example, a mapping scenario that involves three sets: employees, worksOn, and projects. The workson set contains references to employees and projects tuples, capturing an N:M relationship. Assume that me is a basic mapping for employees, mp is a basic mapping for projects, and mw is a basic mapping that maps employees and projects by joining them via workson. The resulting graph G of mappings contains two mapping trees (i.e., two nested mappings), which both yield valid interpretations: T1={memw} and T2={mpmw}. Both trees share mw, as a leaf. If only one tree is arbitrarily used and the other is ignored, then source data can be lost: the nested mapping based on T1 maps all the employees; however, it maps only projects that are associated with an employee via worksOn. The situation is reversed for T2.
However, the inclusion of the shared subtrees in all their “parent” trees will create nested mappings that lead to redundancy in execution as well as in the generated data. To avoid this, a simple strategy is adopted to keep a shared subtree in only one of the parent trees and prune it from all the others. For the example in Section 3, T1 is kept intact and the common subtree is cut from T2, yielding T′2={mp}. In general, however, the algorithm should not make a choice of which trees to prune and which to keep intact. This is a semantic and application-dependent decision. The various choices lead to inequivalent mappings that do not lose data but give preference to certain correlations in the data (e.g., group projects by employees as opposed to grouping employees by projects). Furthermore, there can be differences in the performance of the subsequent execution of the data transformation.
Ideally, a human user could suggest which mapping to generate, if exposed to all the possible choices of mappings with shared submappings. One embodiment implements a strategy that selects one of the pruning choices whenever there is such choice, but another embodiment allows users to explore the space of such choices.
One of the main reasons for creating mappings is to be able to automatically create a query or program that transforms an instance of the source schema into an instance of the target schema. Previous works described how to generate queries from basic mapping specifications. Those works are extended herein to address nested mappings. Because the queries generated by the process described herein start from the more expressive nested mapping specification, these queries often perform better, have more functionality in terms of grouping and restructuring, and at the same time are closer to the mapping specification and therefore easier to understand.
Section 4.1 presents a general query generation algorithm that works for nested mappings with arbitrary Skolem functions for the set elements, and hence for arbitrary regrouping and restructuring of the source data. Section 4.2 presents an optimization that simplifies the query and significantly improves performance in the case of nested mappings with default Skolemization, which are the mappings that produced with the nested mapping generation algorithm described herein. In particular, the optimization of Section 4.2 greatly impacts the scenarios in which no complex restructuring of the source is needed. Many schema evolution scenarios follow this pattern.
The general algorithm for query generation produces queries that process source data in two phases. This query generation algorithm starts at step 1100 of
First-phase query: This subsection describes the step 1102 construction of the flat views and of the first-phase query. For each target set type for which there is a mapping that asserts some tuple for the mapping, there is a view, with an associated schema and a query defining the view. To illustrate, consider an example (a.k.a. the example in Section 4.1) that includes the schemas of
As it can be seen, the view for each set type includes the atomic type elements that are directly under the set type. Additionally, setID columns are included for each of the set types that are directly nested under the given set type. Finally, for each set type that is not top-level there is an additional column setID. In the view schema example presented above, dept is the only top-level set type. Using emps to illustrate, the need for the additional setID column is explained as follows: While in the target schema there is only one set type emps, in an actual instance there may be many sets of employee tuples, nested under the various dept tuples. However, the tuples of these nested sets will all be mapped into one single table (i.e., emps). In order to remember the association between employee tuples and the sets they belong to, the setID column is used to record the identity of the set for each employee tuple. This setID column is later used to join with the empsID column under the “parent” table dept, to construct the correct nesting.
This subsection next describes the queries defining the views and how these queries are generated. The query generation algorithm starts by Skolemizing each nested mapping and decoupling it into a set of single-headed constraints, each consisting of one implication and one atom in the right-hand side of the implication. For the example in Section 4.1, the nested mapping n generates the following four constraints (i.e., one constraint for each target atom in n):
Skolemization replaces every existentially-quantified variable by a Skolem function that depends on all the universally-quantified variables that appear before the existential variable in the original mapping. For example, the atomic variable ?x along with all of its occurrences is replaced by X[d,p,E0], where X is a new Skolem function name. That is, E0 is the set id and not the contents. Thus, the Skolem function does not depend on the actual values under E0. Atomic variables that do not play an important role (e.g., not a key or a foreign key) can be replaced by null (see ?b presented above). Finally, all existential set variables are replaced by Skolem terms if they are not already given by the mapping. Each of the four constraints presented above can be seen as an assertion of “facts” that relate tuples and set ids. For example, r3 shown above asserts a fact relating the tuple (e, s, P′[d, p,E0, e, s]) and the set id E[d, p, E0].
Next, the queries defining the contents of the flat views have the role of storing the facts asserted by the above constraints into the corresponding flat views. For example, all the facts asserted by r3 are stored into emps, where the setID column is used to store the set ID, as explained earlier. The following is the set of query definitions for the aforementioned four views:
Note that if multiple mappings contribute tuples to a target set type, then each such mapping will contribute with a query expression and the corresponding view is defined by the union of all these query expressions. In the case in which the same Skolem function is used from multiple mappings to define the same set instance (e.g., as discussed in Section 2.3), then the union of queries defining the view will effectively accumulate all the tuples of this set instance within the view. Moreover, all these tuples will have the same set id.
Second-phase query: Finally, in step 1104, the previously defined views are used within a query (see q presented below) that combines and nests the data according to the shape of the target schema. Notice that the nesting of data on the target is controlled by the Skolem function values computed for the set id columns in the views.
The two-phase query generation algorithm of Section 4.1 is general in the sense that it can work for arbitrary restructuring of the data. However, the query generation algorithm of Section 4.1 does require the data to be flattened before being re-nested in the target format. In cases in which the source and target schemas have similar nesting shape and the grouping behavior given by the default Skolem functions is sufficient, the two-phase strategy can be inefficient. In such cases, a query optimization process of
The query optimization process begins at step 1200. In step 1202, query generator 712 (see
For example, the nested mapping n used in Section 4.1 falls in the category of nested mappings with default Skolemization, as determined by step 1202. Under default Skolemization, all the set ids that are created (i.e., created by the first-phase query) depend on entire source tuples rather than individual pieces of these tuples. To illustrate, the default Skolem function E for emps depends on p.dname, p.pname and p.emps, which is equivalent to saying that E is a function of the source tuple p. Similarly, the Skolem function P for projects under departments depends on p. Also, the Skolem function P′ for projects under employees depends on p.dname, p.pname, p.emps and e.ename and e.salary, which means that P′ is a function of the source tuples p and e. Under such a scenario, the views defined by the first-phase query are inlined in step 1204 into the places where the views occur in the second-phase query. Using the example in Section 4.1 and taking care to rename conflicting variable names, following rewrite of q is obtained:
Since the Skolem functions are one-to-one id generators, the equalities of the function terms are now replaced with the equalities of the arguments in step 1206. Thus E[p]=E[p′] is replaced with p=p′. Also, P′[p′, e]=P′[p″, e′] is replaced with the conjunction of p′=p″ and e=e′. Furthermore, P[p]=P[p′] is replaced with p=p′. Hence, a rewriting of q′ is obtained where some of the inner loops are unnecessary. The redundant parts in q′ presented above include: (1) for p′ in proj, and where E[p]=E[p′] following emps=; (2) for p″ in proj, e′ in p″.emps where P′[p′,e]=P′[p″,e′] following the innermost projects=; and (3) for p′ in proj where P[p]=P[p′] following the outermost projects=. The query q′ is then rewritten by removing the declaration of p′ and the self-join condition p=p′. If this is done at all levels where setID equalities are used, then the above-listed redundant parts (1)-(3) of the query can be redacted in step 1208. In some cases, the loops are completely replaced by singleton set expressions—this happens for both proj eats sets in the example in Section 4.1. The final query (i.e., the result of the rewritten query in step 1206 followed by the removal of redundant parts in step 1208) is shown below as q″, which tightly follows the expressions and optimizations encoded in the nested mapping n.
A number of experiments were conducted to understand the performance of (a) the nested mapping queries described in Section 4 and (b) the nested mapping creation algorithm of Section 3. The nested mapping prototype described herein is implemented in Java. The experiments were performed on a PC-compatible machine, with a single 2.0 GHz P4 CPU and 1 GB RAM, running Windows XP (SP1) and JRE 1.4.2. Each experiment was repeated three times, and the average of the three trials is reported.
First, the performance of queries generated using nested mappings is compared with the performance of queries generated from basic mappings. This comparison focuses on a schema evolution scenario where nested mappings with default Skolemization suffice to express the desired transformation and inlining is applied to optimize the nested mapping query, as described in Section 4.2. A nested schema authorDB was created based on the Digital Bibliography & Library Project (DBLP) structure, but with four levels of nesting. The first level contains an author set. Each author tuple has an attribute name and a nested set of confJournal tuples. Each confJournal tuple has an attribute name and a set of year tuples. Each year tuple contains a yr attribute and a set of pub elements, each with five attributes: pubId, title, pages, cdrom, url.
The basic and nested mapping algorithms were run on four different settings to create four pairs of mappings (i.e., one basic and one nested). Nested schema authorDB was used as the source and target schema and added different sets of correspondences to create the four different settings. In the first, m1, only the top-level author set was mapped (i.e., only one correspondence between the name attributes of author was used). In the second mapping, the first and the second level of authorDB (i.e., author and confJournal) was mapped. Since levels 1 and 2 were mapped, this mapping is herein referred to as m12. In the same fashion, correspondences were added for the third and fourth levels of authorDB, creating mappings m123 and m1234, respectively.
For each mapping, two XQuery scripts were created: one generated using the basic mappings, and another generated from the nested mappings, as described in Sections 4.1 and 4.2.
The queries were run using the Saxon XQuery processor with increasingly larger input files.
A cursory inspection of the queries in
The basic mapping query strategy can also create a large number of duplicates in the output instance. To illustrate this problem, a mapping m14 was created that maps the author and pub levels of the schema. The queries for m14 and m1234 were run using an input instance that contains 4173 author elements and a total of 6468 pub elements nested within those authors. The count of resulting author and pub elements in the output instance is shown in this table:
The nested mapping queries do not create duplicates for any of the two mappings and produce a copy of the input instance, which is the expected output instance in all these mappings. The basic mapping queries, on the other hand, create 2295 duplicate author elements. A duplicate is created whenever an author has more than one publication. Each author duplicate then carries the same set of duplicate publications causing an explosion of duplicate pub elements. The nested mapping query that is automatically generated by the algorithm described herein does not suffer from this common problem.
This section reviews the performance and scalability of the nested mapping generation algorithm.
For the chain scenario, the number of different sources (m) and the number of inter-linked relational tables (depth) was increased (i.e., 1≦m≦20 and 1≦depth≦3). In the worst case, the prototype took 0.2 seconds to compute the nested mapping. For the authority scenario, the number of sources (m) and the branching factor (n) (i.e., the number of child tables) were simultaneously increased such that m=n for each trial.
Finally, the algorithm performance was evaluated with a mapping that uses the Mondial schema, a database of geographical data. Mondial has a relational representation with 28 relations and a maximum branching factor of 9. Its XML Schema counterpart has a maximum depth of 5 and a maximum branching factor of 9. The relational was mapped into the XML representation and 26 basic mappings were created in 1.2 seconds. The nesting algorithm then extracted 10 nested mappings in 2.8 seconds.
Described herein is a new, structured mapping formalism called nested mappings that provides a natural way to express correlations between schema mappings. The benefits of this formalism are demonstrated herein, including increased specification accuracy and the ability to specify and customize grouping semantics declaratively. An algorithm is provided herein to generate nested mappings from standard schema matchings. The present application shows how to compile these mappings into transformation queries that can be much more efficient than their counterparts obtained from the earlier basic mappings. The new transformation queries also generate much cleaner data. Certainly nested mappings have important applications in schema evolution where the mapping must be able to ensure that the grouping of much of the data is not changed. Indeed the work herein was largely inspired by the inability of existing mapping formalisms to faithfully represent the “identity mapping” for many schemas.
Local memory elements of memory 1704 are employed during actual execution of the program code for generating nested mapping specifications 1714 and for generating transformation queries 1716. Cache memory elements of memory 1704 provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution. Further, memory 1704 may include other systems not shown in
Memory 1704 may comprise any known type of data storage and/or transmission media, including bulk storage, magnetic media, optical media, random access memory (RAM), read-only memory (ROM), a data cache, a data object, etc. Storage unit 1712 is, for example, a magnetic disk drive or an optical disk drive that stores data. Moreover, similar to CPU 1702, memory 1704 may reside at a single physical location, comprising one or more types of data storage, or be distributed across a plurality of physical systems in various forms. Further, memory 1704 can include data distributed across, for example, a LAN, WAN or storage area network (SAN) (not shown).
I/O interface 1706 comprises any system for exchanging information to or from an external source. I/O devices 1710 comprise any known type of external device, including a display monitor, keyboard, mouse, printer, speakers, handheld device, printer, facsimile, etc. Bus 1708 provides a communication link between each of the components in computing unit 1700, and may comprise any type of transmission link, including electrical, optical, wireless, etc.
I/O interface 1706 also allows computing unit 1700 to store and retrieve information (e.g., program instructions or data) from an auxiliary storage device (e.g., storage unit 1712). The auxiliary storage device may be a non-volatile storage device (e.g., a CD-ROM drive which receives a CD-ROM disk). Computing unit 1700 can store and retrieve information from other auxiliary storage devices (not shown), which can include a direct access storage device (DASD) (e.g., hard disk or floppy diskette), a magneto-optical disk drive, a tape drive, or a wireless communication device.
The invention can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment containing both hardware and software elements. In a preferred embodiment, the invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc.
Furthermore, the invention can take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for generating nested mapping specifications 1714 and for generating transformation queries 1716 for use by or in connection with a computing unit 1700 or any instruction execution system to provide and facilitate the capabilities of the present invention. For the purposes of this description, a computer-usable or computer-readable medium can be any apparatus that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, RAM 1704, ROM, a rigid magnetic disk and an optical disk. Current examples of optical disks include compact disk-read-only memory (CD-ROM), compact disk-read/write (CD-R/W) and DVD.
The flow diagrams depicted herein are provided by way of example. There may be variations to these diagrams or the steps (or operations) described herein without departing from the spirit of the invention. For instance, in certain cases, the steps may be performed in differing order, or steps may be added, deleted or modified. All of these variations are considered a part of the present invention as recited in the appended claims.
While embodiments of the present invention have been described herein for purposes of illustration, many modifications and changes will become apparent to those skilled in the art. Accordingly, the appended claims are intended to encompass all such modifications and changes as fall within the true spirit and scope of this invention.
This application is related to the following commonly assigned patent applications, which are hereby incorporated herein by reference in their entirety: (1) U.S. patent application Ser. No. 11/326,969, filed on Jan. 6, 2006, and entitled “Mapping-Based Query Generation with Duplicate Elimination and Minimal Union.” (2) U.S. patent application Ser. No. 11/343,503, filed on Jan. 31, 2006, and entitled “Schema Mapping Specification Framework.”