The present invention generally relates to schema mappings, and more particularly relates to correlating schema mappings.
Schema mappings are essential building blocks for information integration. One of the main steps in the integration or exchange of data is to design the mappings that describe the desired relationships between the various source schemas or source formats and the target schema. Once the mappings are established they can be used either to support query answering on the (virtual) target schema, a process that is traditionally called data integration (See, for example, M. Lenzerini: Data Integration: A Theoretical Perspective, PODS, pages 233-246, 2002, which is hereby incorporated by reference in its entirety), or to physically transform the source data into the target format, a process referred to as data exchange (See, for example, R. Fagin, Ph. G. Kolaitis, R. J. Miller, and L. Popa: Data Exchange: Semantics and Query Answering, TCS, 336(1):89-124, 2005, which is hereby incorporated by reference in its entirety).
In one embodiment a method is disclosed. The method comprises receiving a set of schema mappings over a source schema and a target schema. Each of the schema mapping is decomposed into a basic schema mapping that specifies a mapping behavior for a single target relation and comprises a description of attributes within the single target relation. A plurality of basic schema mappings is generated as a result of the schema mappings being decomposed. A first set of relations is determined for the source schema and a second set of relations is determined for the target schema. The first set of relations and the second set of relations are related by referential constraints within the source schema and the target schema. Each relation in the first set of relations is paired to at least one relation in the second set of relations. The pairing forms a plurality of relation pairs between the first set of relations and the second set of relations in the form of (T, T′), where T is a source portion of a relation pair and T′ is a target portion of the relation pair. A set of basic schema mappings from the plurality of basic schema mappings is identified, for at least on relation pair, that matches the relation pair. Each basic schema mapping in the set of basic schema mappings is merged into a single schema mapping.
In another embodiment a system is disclosed. The system comprises memory and a processor that is communicatively coupled to the memory. A schema mapping merger is communicatively coupled to the memory and the processor. The schema mapping merger is configured to receive a set of schema mappings over a source schema and a target schema. Each of the schema mapping is decomposed into a basic schema mapping that specifies a mapping behavior for a single target relation and comprises a description of attributes within the single target relation. A plurality of basic schema mappings is generated as a result of the schema mappings being decomposed. A first set of relations is determined for the source schema and a second set of relations is determined for the target schema. The first set of relations and the second set of relations are related by referential constraints within the source schema and the target schema. Each relation in the first set of relations is paired to at least one relation in the second set of relations. The pairing forms a plurality of relation pairs between the first set of relations and the second set of relations in the form of (T, T′), where T is a source portion of a relation pair and T′ is a target portion of the relation pair. A set of basic schema mappings from the plurality of basic schema mappings is identified, for at least on relation pair, that matches the relation pair. Each basic schema mapping in the set of basic schema mappings is merged into a single schema mapping.
In yet another embodiment, a computer program product comprising a computer readable storage medium having computer readable program code embodied therewith is disclosed. The computer readable program code comprises computer readable program code configured to receive a set of schema mappings over a source schema and a target schema. Each of the schema mapping is decomposed into a basic schema mapping that specifies a mapping behavior for a single target relation and comprises a description of attributes within the single target relation. A plurality of basic schema mappings is generated as a result of the schema mappings being decomposed. A first set of relations is determined for the source schema and a second set of relations is determined for the target schema. The first set of relations and the second set of relations are related by referential constraints within the source schema and the target schema. Each relation in the first set of relations is paired to at least one relation in the second set of relations. The pairing forms a plurality of relation pairs between the first set of relations and the second set of relations in the form of (T, T′), where T is a source portion of a relation pair and T′ is a target portion of the relation pair. A set of basic schema mappings from the plurality of basic schema mappings is identified, for at least on relation pair, that matches the relation pair. Each basic schema mapping in the set of basic schema mappings is merged into a single schema mapping.
The accompanying figures where like reference numerals refer to identical or functionally similar elements throughout the separate views, and which together with the detailed description below are incorporated in and form part of the specification, serve to further illustrate various embodiments and to explain various principles and advantages all in accordance with the present invention, in which:
The schema mapping merger 104 also enables a new “divide-and-merge” paradigm for mapping creation. The design is divided into smaller components that are easier to create and understand. The schema mapping merger 104 uses these smaller components to automatically generate a meaningful overall mapping. The schema mapping merger 104 improves the quality of the schema mappings by significantly increasing the similarity between the input source instance and the generated target instance. The operation(s) performed by the schema mapping merging 104 are herein referred to as the “MapMerge” operation(s).
The schema mapping merger 104, in one embodiment, comprises a decomposer module 106 that decomposes input mapping assertions into basic components that are intuitively easier to merge. The schema mapping merger 104 also comprises an association module 108 that utilizes an algorithm such as, but not limited to, a chase algorithm to compute associations (referred to here as “tableaux”) from source and target schemas, as well as from source and target assertions of the input mappings. By pairing source and target tableaux the schema mapping merger 104 obtains all (or at least a portion of) the possible skeletons of mappings.
The schema mapping merger 104 further comprises a correlated mappings constructor 110. The correlated mappings constructor 110 constructs correlated mappings by taking, for each skeleton, the union of all (or at least a portion of) the basic components generated by the decomposer 104 that “match” the skeleton. The schema mapping merger 104 further comprises an optimizer 112 that eliminates residual equality constraints and also flags conflicts that may arise and that need to be addressed by the user. These conflicts occur when multiple mappings that map to the same portion of the target schema contribute with different, irreconcilable behaviors. The schema mapping merger 104 and its components are discussed in greater detail below.
Consider a mapping scenario between the schemas S1 201 and S2 203 as shown in
Independent Mappings: assume the existence of the following (independent) schema mappings from S1 201 to S2 203. The first mapping is the constraint t1 302 under the input mappings from S1 201 to S2 203 shown in
In one example, the system (re)generate a “good” overall schema mapping from S1 201 to S2 203 based on its input mappings. It should be noted that the input mappings, when considered in isolation, do not generate an ideal target instance. For example, consider the source instance I 402 in
Since the three mapping constraints are not correlated, the three did values (D1, D2, D3) are distinct (there is no requirement that they must be equal). As a result, the target instance J1 404 exhibits the typical problems that arise when uncorrelated mappings are used to transform data: (1) duplication of data (e.g., multiple Dept tuples for CS with different did values), and (2) loss of associations where tuples are not linked correctly to each other (e.g., the association between project name Web and department name CS that existed in the source is lost).
Correlated Mappings via the Schema Mapping Merger 104: consider now the schema mappings that are shown in
The target instance that is obtained by applying the result of the MapMerge operation is the instance J2 406 shown in
Flows of Mappings: taking the idea of mapping reuse and modularity one step further, an even more compelling use case for the MapMerge operation in conjunction with mapping composition is the flow-of-mappings scenario. With respect to mapping composition see, for example, R. Fagin, P. G. Kolaitis, L. Popa, and W. Tan: Composing Schema Mappings: Second-Order Dependencies to the Rescue, TODS, 30(4):994-1055, 2005; J. Madhavan and A. Y. Halevy: Composing Mappings Among Data Sources, VLDB, pages 572-583, 2003; and A. Nash, P. A. Bernstein, and S. Melnik: Composition of Mappings given by Embedded Dependencies, PODS, pages 172-183, 2005, which are hereby incorporated by reference in their entireties. With respect to the flow-of-mappings-scenario see, for example, A. Nash and P. A. Bernstein and S. Melnik: Composition of Mappings given by Embedded Dependencies, PODS, pages 172-183, 2005, which is hereby incorporated by reference in its entirety.
One embodiment produces a data transformation from the source to the target, the process can be decomposed into several simpler stages, where each stage maps from or into some intermediate, possibly simpler schema. Moreover, the simpler mappings and schemas play the role of reusable components that can be applied to build other flows. Such abstraction is directly motivated by the development of real-life, large-scale ETL flows such as those typically developed with IBM Information Server (Datastage), Oracle Warehouse Builder, and others.
For example, consider transforming data from the schema S1 201 of
Once these individual mappings are established, the same problem of correlating the mappings arises. In particular, one has to correlate mCS ∘m, which is the result of applying mapping composition to mCS and m, with the mappings m1 for Emp and m2 for Proj. This correlation ensures that all (or at least a portion of) employees and projects of computer science departments will be correctly mapped under their correct departments in the target schema.
In this example, composition itself gives another source of mappings to be correlated by the schema mapping merger 104. While similar with composition in that it is an operator on schema mappings, operations of the schema mapping merger 104 are fundamentally different in that they correlate mappings that share the same source schema and the same target schema. In contrast, composition takes two sequential mappings where the target of the first mapping is the source of the second mapping. Nevertheless, the two operators are complementary and together they can play a fundamental role in building data flows.
Preliminaries
A schema consists of a set of relation symbols, each with an associated set of attributes. Moreover, each schema can have a set of inclusion dependencies modeling foreign key constraints. It should be noted that even though the following discussion is directed to a relational case, one or more embodiments are also applicable to the more general case of a nested relational data model (see, for example, L. Popa, Y. Velegrakis, R. J. Miller, M. A. Hernandez, and R. Fagin: Translating Web Data, VLDB, pages 598-609, 2002, which is hereby incorporated by reference in its entirety), where the schemas and mappings can be either relational or XML.
Schema Mappings: a schema mapping is a triple (S,T,Σ) where S is a source schema, T is a target schema, and Σ is a set of second-order tuple generating dependencies (SO tgds). Throughout the following discussion, the following notation is used for expressing SO tgds:
for {right arrow over (x)} in {right arrow over (S)} satisfying B1({right arrow over (x)}) exists {right arrow over (y)} in T where B2({right arrow over (y)}) and C({right arrow over (x)}, {right arrow over (y)})
Examples of SO tgds in this notation are shown in
Note that the SO tgds of one or more embodiments do not allow equalities between or with Skolem terms in the satisfying clause. While such equalities may be needed for more general purposes they do not play a role for data exchange and can be eliminated, as observed in C. Yu and L. Popa. Semantic Adaptation of Schema Mappings when Schemas Evolve. VLDB, pages 1006-1017, 2005, which is hereby incorporated by reference in its entirety.
Chase-Based Semantics: the semantics adopted in at least one embodiment for a schema mapping (S,T,Σ) is the standard data-exchange semantics where, given a source instance I, the result of “executing” the mapping is the target instance J that is obtained by chasing I with the dependencies in Σ. Since the dependencies in Σ are SO tgds, an extension of the chase as defined in Fagin et al.: Composing Schema Mappings: Second-Order Dependencies to the Rescue, is used in one embodiment.
The chase provides a way of populating the target instance J in a minimal way, by adding the tuples that are needed by Σ. For every instantiation of the for clause of a dependency in Σ such that the satisfying clause is satisfied but the exists and where clauses are not, the chase adds corresponding tuples to the target relations. Fresh new values (also called labeled nulls) are used to give values for the target attributes for which the dependency does not provide a source expression. Additionally, Skolem terms are instantiated by nulls in a consistent way: a term F[x1, . . . , xi] is replaced by the same null every time x1, . . . , xi are instantiated with the same source tuples. Finally, to obtain a valid target instance, the schema mapping merger 104 chase (if needed) with the target constraints.
Using the example given above with respect to
Correlating Mappings
The following is a more detailed discussion on the MapMerge operation performed by the schema mapping merger 104. As will be shown, the MapMerge operation generates correlations between mappings that preserve the natural data associations in the source without introducing extra associations.
The schema mapping merger 104 exploits the structure and the constraints in the schemas in order to define what natural associations are. Two data elements are considered associated if they are in the same tuple or in two different tuples that are linked via constraints. See, for example, L. Popa et al.: Translating Web Data. This idea provides the first (conceptual) step towards the MapMerge operation. For the example discussed above with respect to
t′3: for w in Works, g in Group satisfying w·gno=g·gno
Formally, the above rewriting from t3 to t′3 is captured by the chase procedure. See, for example, C. Beeri and M. Y. Vardi: A Proof Procedure for Data Dependencies, JACM, 31(4):718-741, 1984; and D. Maier, A. O. Mendelzon, Y. Sagiv: Testing Implications of Data Dependencies, tods, 4(4):455-469, 1979, which are hereby incorporated by reference in their entireties. The chase is a convenient tool to group together, syntactically, elements of the schema that are associated. The chase by itself, however, does not change the semantics of the mapping. In particular, the above t′3 does not include any additional mapping behavior from Group to Dept.
The schema mapping merger 104 can also reuse or borrow mapping behavior from a more general mapping to a more specific mapping. The schema mapping merger 104 uses a heuristic that changes the semantics of the entire schema mapping and produces a better one, with consolidated semantics. For example, consider the first mapping constraint 502 in
t″3: for w in Works, g in Group satisfying w·gno=g·gno
The algorithm utilized by the schema mapping merger 104 for performing the MapMerge operation is more complex than intuitively suggested above, and will now be discussed in greater detail.
In addition to being single-relation in the target, each basic SO tgd gives a complete specification of all (or at least a portion of) the attributes of the target relation. More precisely, each basic SO tgd has the form
for {right arrow over (x)} in {right arrow over (S)} satisfying B1({right arrow over (x)})
exists y in T where AεAtts(y)y·A=eA({right arrow over (x)})
where the conjunction in the where clause contains one equality constraint for each attribute of the record y asserted in the target relation T. The expression eA({right arrow over (x)}) is either a Skolem term or a source expression (e.g., x·B). Part of the role of the decomposition phase is to assign a Skolem term to every target expression y·A for which the initial mapping does not equate it to a source expression.
For the example given above with respect to
(b1): for g in Group exists d in Dept
(b2): for w in Works, g in Group
(b′2): for w in Works, g in Group
(b3): for w in Works exists p in Proj
The basic SO tgd b1 is obtained from t1; the main difference is that d·did, whose value was unspecified by t1, is now explicitly assigned the Skolem term F[g]. The only argument to F is g because g is the only record variable that occurs in the for clause of t1. Similarly, the basic SO tgd b3 is obtained from t3, with the difference being that p·budget and p·did are now explicitly assigned the Skolem terms H1[w] and, respectively, H2[w].
In the case of t2, it should be noted that there are two existentially quantified variables, one for Emp and one for Dept. Hence, the decomposition algorithm generates two basic SO tgds: the first one maps into Emp and the second one maps into Dept. Observe that b2 and b2, are correlated and share a common Skolem term G[w,g] that is assigned to both e·did and d·did. Thus, the association between e·did and d·did in the original schema mapping t2 is maintained in the basic SO tgds b2 and b2′.
In general, the decomposition process ensures that associations between target facts that are asserted by the original schema mapping are not lost. The process is similar to the Skolemization procedure that transforms first order tgds with existentially quantified variables into second order tgds with Skolem functions. After such Skolemization, all (or at least a portion of) the target relations can be separated since they are correlated via Skolem functions. Therefore, the set of basic SO tgds that results after decomposition is equivalent to the input set of mappings.
The second phase 604 is performed by the association module 108. In this phase the association module 108 applies a chase algorithm to compute associations (tableaux), from the source and target schemas, as well as from the source and target assertions of the input mappings. As discussed above, by pairing source and target tableaux, all (or at least a portion of) the possible skeletons of mappings are obtained. The algorithm 600 of
In more detail, the association module 108 applies the chase algorithm to compute syntactic associations (tableaux), from each of the schemas and from the input mappings. A schema tableau is constructed by taking each relation symbol in the schema and chasing it with all (or at least a portion of) the referential constraints that apply. The result of such chase is a tableau that incorporates a set of relations that is closed under referential constraints, together with the join conditions that relate those relations. For each relation symbol in the schema, there is one schema tableau. In order to guarantee termination, the chase is stopped whenever cycles are encountered in the referential constraints. See, for example, A. Fuxman et al. and L. Popa et al. In the example given above with respect to
T1={gεGroup}
T2={wεWorks, gεGroup; w·gno=g·gno}
T3={dεDept}
T4={eεEmp, dεDept; e·did=d·did}
T5={pεProj, dεDept; p·did=d·did}
Schema tableaux represent the categories of data that can exist according to the schema. A Group record can exist independently of records in other relations (hence, the tableau T1). However, the existence of a Works record implies that there must exist a corresponding Group record with identical gno (hence, the tableau T2).
Since the MapMerge operation takes as input arbitrary mapping assertions, user-defined mapping tableaux are also generated, which are obtained by chasing the source and target assertions of the input mappings with the referential constraints that are applicable from the schemas. The notion of user-defined tableaux is similar to the notion of user associations in Y. Velegrakis, R. J. Miller, and L. Popa: Mapping Adaptation under Evolving Schemas, VLDB, pages 584-595, 2003, which is hereby incorporated by reference in its entirety. In the example given above with respect to
T2′={wεWorks, gεGroup; w·gno=g·gno, w·addr=“NY”}
Furthermore, the association module 108 then pairs every source tableau with every target tableau to form a skeleton. Each skeleton represents the empty shell of a candidate mapping. For the example in
In the third phase 606, which is performed by the correlated mappings constructor 110, constructs correlated mappings. In this phase 606, the correlated mappings constructor 110, for each skeleton, takes the union of all (or at least a portion of) the basic components generated in the first phase 602 that “match” the skeleton. In particular, the algorithm 600 of
With respect to matching, a basic SO tgd σ matches a skeleton (T,T′) if there is a pair (h,g) of homomorphisms that “embed” σ into (T,T′). This translates into two conditions. First, the for and satisfying clause of σ are embedded into T via the homomorphism h. This means that h maps the variables in the for clause of σ to variables of T such that relation symbols are respected and, moreover, the satisfying clause of σ (after applying h) is implied by the conditions of T. Additionally, the exists clause of σ must be embedded into T′ via the homomorphism g. Since σ is a basic SO tgd and there is only one relation in its exists clause, the latter condition essentially states that the target relation in σ must occur in T′.
For the example discussed above with respect to
(T1,T3,b1) (T1,T4,b1) (T1,T5,b1) (T2,T3,b1)
(T2,T4,b1) (T2,T5,b1b3) (T′2,T3,b1b′2)
Note that the basic SO tgds that match a given skeleton may actually come from different input mappings. For example, each of the basic SO tgds that match (T2′,T5) comes from a separate input mapping (from t1, t2, and t3, respectively). In a sense, behaviors from multiple input mappings are aggregated in a given skeleton.
With respect to computing merged SO tgds, the correlated mappings constructor 110, for each skeleton along with the matching basic SO tgds, constructs a “merged” SO tgd. For the example discussed above with respect to
(s8) for w in Works, g in Group
The variable bindings in the source and target tableaux are taken literally and added to the for and, respectively, exists clause of the new SO tgd. The equalities in T2′ and T4 are also taken literally and added to the satisfying and, respectively, where clause of the SO tgd. More interestingly, for every basic SO tgd a that matches the skeleton (T2′, T4), the correlated mappings constructor 110 takes the where clause of σ (after applying the respective homomorphisms) and add it to the where clause of the new SO tgd. Note that, by definition of matching, the satisfying clause of σ is automatically implied by the conditions in the source tableau. The last three lines in the above SO tgd incorporate conditions taken from each of the basic SO tgds that match (T2′,T4) (i.e., from b1, b2, and b2′, respectively).
The constructed SO tgd consolidates the semantics of b1, b2, and b2′ under one merged mapping. Intuitively, since all (or at least a portion of) three basic SO tgds are applicable whenever the source pattern is given by T2′ and the target pattern is given by T4, the resulting SO tgd takes the conjunction of the “behaviors” of the individual basic SO tgds.
With respect to correlations, a crucial point about the above construction is that a target expression may now be assigned multiple expressions. For example, in the above SO tgd, the target expression d·did is equated with two expressions: F[g] via b1, and G[w,g] via b2′. In other words, the semantics of the new constraint requires the values of the two Skolem terms to coincide. This is actually what it means to correlate b1 and b2′. Such a correlation can be represented, explicitly, as the following conditional equality (implied by the above SO tgd):
for w in Works, g in Group satisfying w·gno=g·gno and w·addr=“NY”
The term “residual equality constraint” is used for such equality constraint where one member in the implied equality is a Skolem term while the other is either a source expression or another Skolem term. Such constraints have to be enforced at runtime when data exchange is performed with the result of the MapMerge operation. In general, Skolem functions are implemented as (independent) lookup tables, where for every different combination of the arguments, the lookup table gives a fresh new null. However, residual constraints will require correlation between the lookup tables. For example, the above constraint requires that the two lookup tables (for F and G) give the same value whenever w and g are tuples of Works and Group with the same gno value.
The other three merged SO tgds that are generated as a result of the completion of Phase 3 for the example in
(s1) from (T1,T3,b1)
(s6) from (T2,T5,b1b3):
(s9) from (T′2,T5,b1b′2b3):
One aspect to note is that not all skeletons generate merged SO tgds. Although there were six earlier skeletons, only three generate mappings that are neither subsumed nor implied. One embodiment uses the technique for pruning subsumed or implied mappings as discussed in A. Fuxman et al. For an example of a subsumed mapping, consider the triple (T1,T4,b1). A mapping for this is not generated because its behavior is subsumed by s1, which includes the same basic component b1 but maps into a more “general” tableau, namely T3. A mapping into T4, which is a larger (more specific) tableau, is not constructed without actually using the extra part of T4. Implied mappings are those that are logically implied by other mappings. For example, the mapping that would correspond to (T2,T3,b1) is logically implied by s6: they both have the same premise (T2), but s6 asserts facts about a larger tableau (T5, which includes T3) and already covers b1.
Finally, for the example given above with respect to
Since residual equalities cause extra overhead at runtime, it is worthwhile exploring when such constraints can be eliminated without changing the overall semantics. This optimization process is performed by the fourth phase 608 of the MapMerge operation. The fourth phase 608 is performed by the optimizer 112. In this phase the optimizer 112 performs a simplification process and also flags conflicts that may arise and that need to be addressed by the user. These conflicts occur when multiple mappings that map to the same portion of the target schema contribute with different, irreconcilable behaviors. In particular, the algorithm 600 of
Consider the earlier residual constraint stating the equality F[g]=G[w,g] (under the conditions of the for and satisfying clauses). The two Skolem terms F[g] and G[w,g] occur globally in multiple SO tgds. To avoid the explicit maintenance and correlation of two lookup tables (for both F and G), one embodiment attempts the substitution of either F[g] with G[w,g] or G[w,g] with F[g]. Care must be taken since such substitution cannot be arbitrarily applied. First, the substitution can only be applied in SO tgds that satisfy the preconditions of the residual equality constraint. For the current example, either substitution cannot be applied to the earlier SO tgd s1, since the precondition requires the existence of Works tuple that joins with Group.
In general, a check is performed for the existence of a homomorphism that embeds the preconditions of the residual equality constraint into the for and where clauses of the SO tgd. The second issue is that the direction of the substitution matters. For example, substitute F[g] by G[w,g] in every SO tgd that satisfies the preconditions. There are two such SO tgds: s8 and s9. After the substitution, in each of these SO tgds, the equality d·did=F[g] becomes d·did=G[w,g] and can be dropped, since it is already in the where clause. Note, however, that the replacement of F[g] by G[w,g] did not succeed globally. The SO tgds s1 and s6 still refer to F[g]. Hence, the explicit correlation of the lookup tables for F and G is maintained. On the other hand, substitute G[w,g] by F[g] in every SO tgd that satisfies the preconditions. Again, there are two such SO tgds: s8 and s9. The outcome is different now: G[w,g] disappears from both s8 and s9 (in favor of F[g]); moreover, it did not appear in s1 or s6 to start with. It can be stated that the substitution of G[w,g] by F[g] has globally succeeded. Following this substitution, the constraint s9 is implied by s6: they both assert the same target tuples, and the source tableau T2′ for s9 is a restriction of the source tableau T2 for s6. Hence from now the constraint s9 can be discarded.
Similarly, based on the other residual equality constraint discussed above, the substitution of H2[w] by F[g] can be applied. This affects only s6 and the outcome is that H2[w] has been successfully replaced globally. The resulting SO tgds, for the example of
(s1) for g in Group
(s′6) for w in Works, g in Group satisfying w·gno=g·gno
(s′8) for w in Works, g in Group
As explained above, both sc and s8, can be simplified, by removing the assertions about Dept, since they are implied by s1. The result is then identical to the SO tgds shown
The process shown in
The last issue that arises is the case of conflicts in mapping behavior. Conflicts can also be described via constraints, similar to residual equality constraints, but with the main difference that both members of the equality are source expressions (and not Skolem terms). To illustrate, it might be possible that a merged SO tgd asserts that the target expression d·dname is equal to both g·gname (from some input mapping) and with g·code (from some other input mapping, assuming that code is some other source attribute). Then, conflicting semantics are obtained with two competing source expressions for the same target expression. The optimization process shown in
Evaluation
To evaluate the quality of the data generated based on the MapMerge operation performed by the schema mapping merger 104, a measure can be utilized that captures the similarity between a source and target instance by measuring the amount of data associations that are preserved by the transformation from the source to the target instance. This similarity measure was used in experiments to show that the mappings derived by the MapMerge operation are better than the input mappings.
This similarity measure captures the extent to which the “associations” in a source instance are preserved when transformed into a target instance of a different schema. For each instance, a single relation is computed that incorporates all (or at least a portion of) the natural associations between data elements that exist in the instance. There are two types of associations that are considered. The first type is based on the chase with referential constraints and is naturally captured by tableaux. As discussed above with respect to the second phase 604 of the MapMerge operation, a tableau is a syntactic object that takes the “closure” of each relation under referential constraints. The join query that is encoded in each tableau can then be materialized and all (or at least a portion of) the attributes that appear in the input relations can be selected (without duplicating the foreign key/key attributes). Thus, for each tableau a single relation is obtained, referred to as “tableau relation” that conveniently materializes together data associations that span multiple relations. For example, the tableau relations for the source instance I 402 in
The second type of association that is considered is based on the notion of full disjunction. See for example, C. A. Galindo-Legaria: Outerjoins as Disjunctions, SIGMOD Conference, pages 348-358, 1994; and A. Rajaraman and J. D. Ullman: Integrating Information by Outerjoins and Full Disjunctions, PODS, pages 238-248, 1996, which are hereby incorporated by reference in their entireties. The full disjunction of relations R1, . . . , Rk, denoted as FD(R1, . . . , Rk), captures in a single relation all (or at least a portion of) the associations (via natural join) that exist among tuples of the input relations. The reason for using full disjunction is that tableau relations by themselves do not capture all (or at least a portion of) the associations. For example, consider the association that exists between John and Web in the earlier source instance J2. There, John is an employee in CS, and Web is a project in CS. However, since there is no directed path via foreign keys from John to Web, the two data elements appear in different tableau relations of τ(J2) (namely, DeptEmp and DeptProj). On the other hand, if the natural join between DeptEmp and DeptProj is taken the association between John and Web will appear in the result. Thus, to capture all (or at least a portion of) such associations, an additional step is applied that computes the full disjunction FD(τ(I)) of the tableau relations. This generates a single relation that conveniently captures all (or at least a portion of) the associations in an instance I of schema S. Each tuple in this relation corresponds to one association that exists in the data.
Operationally, full disjunction performs the outer “union” of all (or at least a portion of) the tuples in every input relation, together with all (or at least a portion of) the tuples that arise via all (or at least a portion of) possible natural joins among the input relations. To avoid redundancy, minimal union is used instead of union. This means that in the final relation, tuples that are subsumed by other tuples are pruned. A tuple t is subsumed by a tuple t′ if for all (or at least a portion of) attributes A such that t·A≠null, it is the case that t′·A=t·A. The full details of implementing full disjunction is omitted for simplicity, but such implementation is part of the experimental evaluation.
For the example shown in
Now that all (or at least a portion of) of the associations are in a single relation, one on each side (source or target), they can be compared. More precisely, given a source instance I and a target instance J, the similarity between I and J is defined by defining the similarity between FD(τ(I)) and FD(τ(J)). However, when tuples between FD(τ(I)) and FD(τ(J)) are compared, arbitrary pairs of attributes should not be compared. To avoid capturing “accidental” preservations, tuples are compared based only on their compatible attributes that arise from the mapping. In the following, it is assume that all (or at least a portion of) the mappings that are needed to evaluate implement the same set V of correspondences between attributes of the source schema S and attributes of the target schema T. This assumption is true for mapping generation algorithms, which start from a set of correspondences and generate a faithful implementation of the correspondences (without introducing new attribute-to-attribute mappings). It is also true for the MapMerge operation and its input, since the MapMerge operation, in one embodiment, does not introduce any new attribute-to-attribute mappings that are not already specified by the input mappings. Given a set V of correspondences between S and T, attribute A of S is compatible with an attribute B of T if either there is a direct correspondence between A and B in V, or (2) A is related to an attribute A′ via a foreign key constraint of S, B is related to an attribute B′ via a foreign key constraint of T, and A′ is compatible with B′. For the example in
Definition 1 (Tuple similarity): Let t1 and t2 be two tuples in FD(τ(I)) and, respectively, FD(τ(J)). The similarity of t1 and t2, denoted as Sim(t1,t2), is defined as:
Sim(t1,t2) captures the ratio of the number of values that are actually exported from t1 to t2 versus the number of values that could be exported from t1 according to V. For instance, let t1 be the only tuple in FD(τ(I)) 1210 in
Definition 2 (Instance similarity): The similarity between FD(τ(I)) and FD(τ(J)) is
Experiments
To evaluate the MapMerge operation the inventors conducted a series of experiments on a set of synthetic mapping scenarios as well as on two real-life mapping scenarios from the biological domain. The synthetic mapping scenarios transform data from a de-normalized source schema with a single relation to a target schema containing a number of hierarchies, with each hierarchy having at its top an “authority” relation, while other relations refer to the authority relation through foreign key constraints. The target schema corresponds roughly to ontological schemas, which often contain top-level concepts that are referred to by many sub-concepts. The synthetic scenarios were also designed to scale so that both the running time performance of the MapMerge operation and the improvement in target data quality as the schemas increase in size can be measured.
As
Operational Flow Diagrams
Referring now to
The schema mapping merger 104, at step 1408, determines a first set of relations for the source schema and a second set of relations for the target schema. The first set of relations and the second set of relations are related by referential constraints within the source schema and the target schema. The schema mapping merger 104, at step 1410, generates a set of skeleton mappings (relation pairs) based on the first and second set of relations. The schema mapping merger 104, at step 1412, identifies, for each skeleton mapping, a set of basic schema mappings from the plurality of basic schema mappings that matches the skeleton mapping. The schema mapping merger 104, at step 1414, merges each matching basic schema mapping into a single schema mapping. The schema mapping merger 104, at step 1416, then performs a syntactic simplification process on the single schema mapping. The control flow then exits at step 1418.
Information Processing System
Referring now to
The information processing system 1500 includes a computer 1502. The computer 1502 has a processor(s) 1504 that is connected to a main memory 1506, mass storage interface 1508, and network adapter hardware 1510. A system bus 1512 interconnects these system components. Although only one CPU 1504 is illustrated for computer 1502, computer systems with multiple CPUs can be used equally effectively. The main memory 1506, in this embodiment, comprises the mapping tool 103, the schema mapping merger 104 and its components, and the schema mappings 105.
The mass storage interface 1508 is used to connect mass storage devices, such as mass storage device 1514, to the information processing system 1500. One specific type of data storage device is an optical drive such as a CD/DVD drive, which can be used to store data to and read data from a computer readable medium or storage product such as (but not limited to) a CD/DVD 1516. Another type of data storage device is a data storage device configured to support, for example, NTFS type file system operations.
An operating system included in the main memory is a suitable multitasking operating system such as any of the Linux, UNIX, Windows, and Windows Server based operating systems. Embodiments of the present invention are also able to use any other suitable operating system. Some embodiments of the present invention utilize architectures, such as an object oriented framework mechanism, that allows instructions of the components of operating system to be executed on any processor located within the information processing system 1500. The network adapter hardware 1510 is used to provide an interface to a network 1518. Embodiments of the present invention are able to be adapted to work with any data communications connections including present day analog and/or digital techniques or via a future networking mechanism.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
Aspects of the present invention may be embodied as a system, method, or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.), or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
Aspects of the present invention have been discussed above with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
The description of the present invention has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. The embodiments above were chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.
This invention was made with Government support under Contract No.: FA9550-07-1-0223, FA9550-06-1-0226 Part VI, Article 20 of FA9550-06-1-0226 awarded by U.S. Air Force, Office of Scientific Research. The Government has certain rights in this invention.