The present invention relates to the storing of hierarchically structured data, and more particularly to the establishment of relationships between hierarchically structured schema components and their effects on relations and content of tuples.
eXtensible Markup Language (XML) schemas, are becoming increasingly popular as a means to describe XML data. But the XML, described by the XML schema, is still often stored in relational tables. Some conventional approaches decompose XML documents using various mapping schemes to the relational structures. However, these approaches do not take into consideration how the components of the XML schema, as defined by W3C, can be used to determine the structure of the relations and the contents of the tuples that can be generated. They use the XML schema as a mapping of an element or attribute in the XML document to a particular column of the relational table. They do not consider the various constructs of an XML schema that may affect the cardinality between the attributes of a relation, and therefore the contents of the tuples. As used in this specification, “structure of relations” refers to the cardinality between the attributes of the relation.
Accordingly, there exists a need for a method for determining relationships between the hierarchically structured schema components and their effects on the structure of relations and content of tuples. The present invention addresses such a need.
A method for determining relationships between hierarchically structured schema components and their effects on structure of relations and content of tuples, includes: analyzing the hierarchically structured schema with user-defined mappings and finding elements and/or attributes mapped to a same relational table; determining relationships between the elements or attributes to be either a one-to-one relationship or a one-to-many relationship based on an information set in the hierarchically structured schema; recording the relationships; and processing a hierarchically structured document against the recorded relationships and generating tuples accordingly. The constructs of a hierarchically structured schema that may affect the cardinality between the attributes of a relation, and thus the contents of the tuples, are considered. A relationship between the hierarchically structured schema model and a relational model is established.
The present invention provides a method for determining relationships between hierarchically structured schema components and their effects on the structure of relations and content of tuples. The following description is presented to enable one of ordinary skill in the art to make and use the invention and is provided in the context of a patent application and its requirements. Various modifications to the preferred embodiment will be readily apparent to those skilled in the art and the generic principles herein may be applied to other embodiments. Thus, the present invention is not intended to be limited to the embodiment shown but is to be accorded the widest scope consistent with the principles and features described herein.
To more particularly describe the features of the present invention, please refer to
XML Schemas
Below is an example XML schema:
Relation Structure
The structure of a relation is a set of attributes that describes an entity, such as a purchase order or an employee. A relation is conventionally expressed as a set of functional dependencies between sets of attributes of the same relation. Besides the conventional approach, this invention takes another way of looking at the relationship between the sets of attributes of any relation or the structure of a relation is by looking at the cardinality of the attribute sets, in other words, the one-to-one or one-to-many relationships. Any use of the term “structure of a relation” in this specification refers to this approach.
Any relation r(R), where R is the number of attributes, can be divided into subsets, such that they have either a one-to-one relationship or a one-to-many relationship with each other. Furthermore, this invention applies an additional restriction on the structure of relation. If there exists attribute sets a, b, and c, such that a⊂R, b⊂R , and c⊂R and a I b I c=0, the relation r(R) can have a one-to-many relationship between a & b and a & c, identified as a π b and a π c, if and only if there exists b π c. This implies that a π c must be a transitively deduced relationship. Thus, a set cannot participate in a one-to-many relationship with two other sets without there being a one-to-many relationship between the other two. For this specification, when a relation is in a 1 normalized form (1NF) and satisfies the above condition, it is said to be in “shred normalized form”.
To illustrate the cardinality relationship between attribute sets of a relation, consider the following PurchaseOrder relation:
PurchaseOrder (POID, ITEMID, QTY, PRICE)
Note that for the same value of POID, there are more than one distinct set of ITEMID, QTY and PRICE. Therefore, there is a one-to-many relationship between the attribute POID and the set ITEMID, QTY and PRICE and since there is only a single one-to-many relationship involving POID, it is in shred normalized form.
An XML schema inherently contains one-to-one, one-to-many, and many-to-many relationships between elements. Since a relation, as shown above, can also be expressed as a set of one-to-one and one-to-many relationships, the method in accordance with the present invention establishes a relationship between the XML schema model and the relational model, as described below.
Relationships Between XML Schema Components and Their Effects on the Structure of Relations and Content of Tuples
If the maxOccurs properties for the Particles P1 and P00 are equal to 1 and greater than 1, respectively, then a one-to-many relationship between the elements is recorded, via step 403. R={b ∴{c, d}}. Here, the set {c, d} can occur more than once for one occurrence of element b. Thus, there is a one-to-many relationship between the set {b} and the set {c, d}.
If the maxOccurs properties for both Particles P1 and P00 are greater than 1 and equal to 1, respectively, then a many-to-one relationship between the elements is recorded, via step 405. The resulting relation would look as follows: R={{c, d}πb}. This means that there might be one or more occurrences of the element b for a single occurrence of the set {c, d}. Thus, the one-to-many relationship is reversed, i.e., there is a one-to-many relationship between the set of elements {c, d} to the set {b}.
If the maxOccurs for both Particles PI and P00 are greater than 1, then there is an error, via step 405, because this will not always produce a shred normalized relation.
Steps 402 through 405 are repeated until all elements mapped to the same relational table are found, via step 406. In this embodiment, the relationships are recorded in a data structure.
As illustrated above, Particles affect the structure of a relation. In addition, ModelGroups also have an effect. Unlike Particles, a ModelGroup affects the content of the tuples that are generated. Because ModelGroups in an XML schema describe the layout of the underlying elements that are mapped to the columns of the same relation, they have a direct impact on what is produced as a tuple. For example, while a ModelGroup of type sequence specifies the order in which elements should appear in the XML document, a ModelGroup of type all allows for the elements to appear in any order. This simple change, in combination with the value of maxOccurs, can cause a significant difference in the tuples that are generated. To illustrate this, consider the example XML schema shown in
First, consider the example where P0 has maxOccurs>1 and the ModelGroup is of type sequence. Consider also the two XML Documents 1 and 2, illustrated in
In Document 2, there is only one instance of MG, since the elements of the ModelGroup have appeared in the expected order. Therefore, only one tuple is generated, as follows:
Now, assume that MG is of type all, which means that P0 must have maxOccurs=1 to ensure determinism, according to the W3C specification. Since the order is not important for ModelGroups of type all, both Document 1 and Document 2 contain only one instance of MG. A change of the type to all thus would generate only one tuple from both documents, as follows:
Now, assume that MG is of type choice. Only one of the elements specified in the ModelGroup can appear for any instance of the ModelGroup. If MG was of type choice and P0 had maxOccurs<1, the resulting tuples for Document 1 and Document 2 would be the same since each instance of an element under the choice ModelGroup is an instance of the ModelGroup itself. Conceptually, this is equivalent to making three copies of the component model, whereby in each copy, the choice ModelGroup is replaced by a sequence ModelGroup with a single Particle P1, P2, or P3 under it in each copy. The appropriate component model is then used during decomposition, depending on which element appeared in the instance document. Therefore, to handle XML schemas that contain choice ModelGroups, during the analysis of the XML schema, before the determination of cardinality of relationships between attribute sets, the following step is added: where there is a choice ModelGroup with N particles in the XML schema, create N copies of the component model, where the choice ModelGroup is replaced by a sequence ModelGroup containing a single particle, each particle being different in each copy. This “cloning” process is repeated for each choice ModelGroup in the set of new copies of the component model until no choice model remains. The final set of copies of the component model are used in the step of determining relationship cardinality. Likewise, in determining whether a XML schema with choice ModelGroups satisfied shred normal form, the final set of clones, rather than the original XML schema, is used.
The following result would be produced for both documents, as follows:
Note that we do not consider a mapping where MG is of type choice and Particles P1, P2 and P3 have maxOccurs>1, to be an instance of illegal many-to-many mapping. This is because of the fact that the type of the model group enforces that elements b, c or d can appear only in a mutually exclusive manner for any instance of the choice ModelGroup. The following relation is inferred for such a mapping:
It can be seen that the property of shred normalized form is still retained for the relation R, shown above, due to the content model enforced by the type of the model group. For any instance of the choice ModelGroup there will only be a single one-to-many relationship i.e. id ∴ b or id ∴ c or id ∴ d. It can also be seen that this is an exception, where a seemingly many-to-many relationship is permitted. A legal many-to-many mapping is therefore now defined as follows: a mapping is considered to be a legal many-to-many relationship between two information items if and only if the lowest common ancestor model group of the two items is a choice model group.
While in the above example, with choice model group, elements b, c and d are mapped to different columns of the same table, it would also be desirable, in some customer scenarios, that elements b, c and d be mapped to the same column of the same table.
The semantics implied by this approach, for such a mapping would mean that information items, that appear for a particular that instance of the choice ModelGroup, will be applied to the tuple. For the above example, consider now that the elements b, c and d are mapped to the same table-column pair. For both documents Document 1 and Document 2, the following set of tuples will be created:
Note that the two items mapped to the same table-column pair need not be direct children of the choice model group. An “effective choice model group” is computed for this purpose. Any two items that are mapped to the same table-column pair are considered to be part of the same effective choice model group if and only if the lowest common ancestor ModelGroup of the two items is a choice ModelGroup. Any pair of items that are mapped to the same table-column and belong to the same effective choice model group will produce tuples with the semantics as shown above.
Now consider for the above example that elements b, c and d are mapped to different table-column pairs, tab1.col2, tab2.col2 and tab3.col2 respectively. Also the attribute id is mapped to tab1.col1, tab2.col1 and tab3.col1. As explained above, for Document1 there are three instances of the choice ModelGroup. However, for the first instance of choice ModelGroup, the elements b and c are absent, for the second instance of the choice ModelGroup elements b and d are absent and for the third instance elements c and d are absent. For absent items, nulls are written in the cells of the tuples that they are mapped to. Therefore, this would produce the following tuples for each of the tables
Clearly, this is not a desirable result since extraneous rows are produced that contain no information. To make matters worse suppose that element c and d never appeared in an instance document, but there were 100 occurrences of element b. This would then produce 100 rows in each table. While in tab1, the column col2 would have information related to each occurrence of element b, but in tables tab2 and tab3, column col2 will contain null for all 100 rows.
To overcome the problem of extraneous rows, the following existential condition is applied to choice ModelGroups: a tuple is created for an item that is directly or indirectly contained in a choice ModelGroup, if and only if, the choice ModelGroup has occurred in response to the occurrence of an element, in the instance document, that is a descendant of the choice ModelGroup, and is either the mapped item itself or an ancestor of the mapped item.
The implication of this rule on the above example would be the following set of tuples for each of the tables:
Note that now the tuples are produced only when the instance of choice model group occurs for the items mapped in that tuple.
There is an additional subtlety that occurs for the following instance document:
As illustrated above, the method in accordance with the present invention uses the type of the ModelGroup and the maxOccurs property of the enclosing Particle to determine the content and number of tuples.
Optionally, to simplify implementation, the following rules can be applied:
(1) There can be any number of entities involved in a relation, only one-to-one or one-to-many relationships are allowed between them to ensure that tuples that are generated are in shred normalized form. A pair of a set of attributes can be involved in a one-to-many relationship, such that the set of attributes that has a cardinality of one in the relationship will be a level above the set of attributes that forms the many parts of the one-to-many relationship. There can be any number of such levels, since a relation may have any number of entities.
(2) There can be no illegal many-to-many relationships and at most a single one-to-many relationship at any level. Otherwise, it is considered an error. A many-to-many relationship between two elements/attributes is legal only if the lowest common ancestor model group of both element/attribute is a choice model group. In other words, if there are three entities x, y, and z, such that x has a one-to-many relationship with y and a one-to-many relationship with z, then it is possible for only one of them to exist at the same level. But, if x has a one-to-one relationship with z, then the relationships between x and y, and x and z, can exist at the same level.
(3) The end of the topmost component that identifies the beginning of a repetitive subset, e.g. Particle or ModelGroup, marks the end of all possible tuples. The beginning of any inner repetitive subset triggers initiation of a new tuple if it is not the first repetition within its parent repetitive set.
A method for determining relationships between hierarchically structured schema components and their effects on structure of relations and content of tuples, includes: analyzing the hierarchically structured schema with user-supplied mappings, making copies of the component model in which a choice ModelGroup with N particles is replaced by a sequence ModelGroup with one particle under the ModelGroup, each particle being different in each copy; and in each copy of the component model, finding elements mapped to a same relational table; determining relationships between the elements to be either a one-to-one relationship or a one-to-many relationship based on the information set in the hierarchically structured schema; recording the relationships; and processing a hierarchically structured document against the recorded relationships and generating tuples accordingly. The constructs of a hierarchically structured schema that may affect the cardinality between the attributes of a relation, and thus the contents of the tuples, are considered. A relationship between the hierarchically structured schema model and a relational model is established.
Although the present invention has been described in accordance with the embodiments shown, one of ordinary skill in the art will readily recognize that there could be variations to the embodiments and those variations would be within the spirit and scope of the present invention. Accordingly, many modifications may be made by one of ordinary skill in the art without departing from the spirit and scope of the appended claims.