1. Field of the Invention
The embodiments herein generally relate to data storage and conversion, and, more particularly, to data management and transformation for storing documents into relational databases.
2. Description of the Related Art
In the information technology (IT) industry, the manner in which to efficiently store eXtensible Markup Language (XML) data into a persistent repository, such as a relational database, is a major technical problem. The reason is that XML is widely used and emerging as the de facto standard format of message exchange between applications running on different computer systems. An XML schema or Document Type Definition (DTD) is called recursive if it allows an element to contain another element with the same name as a descendent element. The possible sequence of these recursive elements can be represented by an expression in an XPath format, hereinafter referred to as a “recursive XPath.” A recursive XML schema or DTD should preferably have at least one recursive XPath. Hereinafter, an XML document abiding to a recursive XML schema or DTD is called “recursive XML document.”
There are many business applications that require the use of recursive XML, such as applications in the life sciences, the insurance industry, and manufacturing. In fact, any information object represented in XML which contains at least one child (or descendant) element with the same features as itself should be defined as recursive. For example, a part can contain another part as a sub-part, which itself can contain a sub-part. Therefore, the part information should be described using recursive XML.
A unique feature of recursive XML is that a portion of the document can have the same structure as the whole document. Moreover, the depth of a recursive XML is not pre-determined due to the above feature. For a recursive XML schema/DTD structure, an XML document instance abiding to the structure could have arbitrarily many levels of recursion. The level of recursion is defined herein as the number of occurrences of the same XML element name in a path from a root node to a leaf node. In practice, documents usually only have a limited number of levels of recursion. Notwithstanding advances in the industry, there remains a need for a new technique of converting hierarchical data to relational data.
In view of the foregoing, the embodiments herein provide a method of converting a recursive XML document into a relational schema, and a program storage device readable by computer, tangibly embodying a program of instructions executable by the computer to perform a method of converting a recursive XML document into a relational schema, wherein the method comprises providing a recursive XML document; parsing an external mapping script specifying a mapping from the recursive XML document to a relational table format; building a recursive shredding tree based on the external mapping script and the relational table format; and shredding the mapped recursive XML document into a relational table. The method may further comprise detecting whether any of a XML schema and a DTD document is recursive, wherein the detecting comprises building a directed graph comprising element names; corresponding elements names as nodes in the directed graph; forming arcs from every element parent node to every element child node of the element parent node; and checking for cycles in the directed graph.
The method may further comprise identifying all recursive cursor nodes and a recursive degree corresponding to the recursive shredding tree. Additionally, the method may further comprise mapping recursive elements of the recursive XML document to shredding tree nodes of the recursive shredding tree. Preferably, the recursive shredding tree comprises a working area hashtable. Moreover, the method may further comprise storing all XPaths of the recursive shredding tree in a global lookup table; performing a depth-first tree traversal of the recursive shredding tree; computing a current XPath for each node in the recursive XML document; comparing the XPath to each of the stored XPaths in the global lookup table; and determining, for all matched XPaths, a corresponding set of arrays comprising tuples of shredded data in the recursive shredding tree.
Another embodiment provides a system of converting a recursive XML document into a relational schema, wherein the system comprises a recursive XML document; a parser adapted to parse an external mapping script specifying a mapping from the recursive XML document to a relational table format; a recursive shredding tree formatted based on the external mapping script and the relational table format; and a relational table comprising the mapped recursive XML document. The system may further comprise a first mechanism adapted to detect whether any of a XML schema and a DTD document is recursive by building a directed graph comprising element names; corresponding elements names as nodes in the directed graph; forming arcs from every element parent node to every element child node of the element parent node; and checking for cycles in the directed graph.
Preferably, the parser is adapted to identify all recursive cursor nodes and a recursive degree corresponding to the recursive shredding tree. Also, the system may further comprise a mapping mechanism adapted to map recursive elements of the recursive XML document to shredding tree nodes of the recursive shredding tree. Preferably, the mapping mechanism comprises a global lookup table. Furthermore, the recursive shredding tree preferably comprises a working area hashtable. The system may further comprise a runtime methodology module adapted to store all XPaths of the recursive shredding tree in a global lookup table; perform a depth-first tree traversal of the recursive shredding tree; compute a current XPath for each node in the recursive XML document; compare the XPath to each of the stored XPaths in the global lookup table; and determine, for all matched XPaths, a corresponding set of arrays comprising tuples of shredded data in the recursive shredding tree. Moreover, the system may further comprise a second mechanism adapted to invoke multiple non-recursive shredding processes based on a content of the mapped recursive XML document.
These and other aspects of the embodiments herein will be better appreciated and understood when considered in conjunction with the following description and the accompanying drawings. It should be understood, however, that the following descriptions, while indicating preferred embodiments herein and numerous specific details thereof, are given by way of illustration and not of limitation. Many changes and modifications may be made within the scope of the embodiments herein without departing from the spirit thereof, and the embodiments herein include all such modifications.
The embodiments herein will be better understood from the following detailed description with reference to the drawings, in which:
The embodiments herein and the various features and advantageous details thereof are explained more fully with reference to the non-limiting embodiments that are illustrated in the accompanying drawings and detailed in the following description. It should be noted that the features illustrated in the drawings are not necessarily drawn to scale. Descriptions of well-known components and processing techniques are omitted so as to not unnecessarily obscure the embodiments herein. The examples used herein are intended merely to facilitate an understanding of ways in which the embodiments herein may be practiced and to further enable those of skill in the art to practice the embodiments herein. Accordingly, the examples should not be construed as limiting the scope of the embodiments herein.
As mentioned, there remains a need for a new technique of converting hierarchical data to relational data. The embodiments herein achieve this by providing a method of shredding specific types of XML documents, recursive XML documents. Referring now to the drawings, and more particularly to
Hereinafter the term “hierarchical data” refers to data arranged in a hierarchical format, whereby elements, or nodes, of the data structure are organized in a descending or ascending hierarchy. A hierarchical data structure is typically illustrated using a descending tree structure. The term “relational data” refers to data arranged in a relational format, whereby elements of the data structure are arranged in rows having one of more columns. A relational data structure is typically illustrated using a table structure. The term “mapping” refers to a system for translating data from one data structure to another data structure. A mapping can be a one-to-one mapping, a many-to-one mapping, a one-to-many mapping or a many-to-many mapping. The term “shredding tree” refers to a data structure used to represent a mapping for translating data from a hierarchical data structure to a relational data structure. The term “schema” refers to a hierarchical structure used for defining relationships between elements, or nodes, of the data structure of the hierarchical data structure and a specific table from the relational structure, and wherein no instance data is present in the schema tree. The term “instance” refers to a hierarchical data abiding to a hierarchical data structure. The instance tree can be viewed as instance of the hierarchical data structure.
The embodiments herein provide a technique to convert a recursive XML shredding process to multiple non-recursive XML shredding processes and extend the process described in U.S. Patent Application No. 2004/0220954, the complete disclosure of which, in its entirety, is herein incorporated by reference. The following example is used describe the embodiments. A recursive XML schema defining a family tree includes an element specified using the recursive XPath //children/male. This XPath can be used to specify multiple chains of father-son relationships. Also, the generation number of the father-son relationship is unknown in general. However, for a given family tree, there are only a limited number of generations. Suppose that it is desired to shred these XML documents describing family trees into a relational database management system (RDBMS) database with a table (for example, father_son) with column names given as “father” and “son”. For a family with five generations of father-son relationships, a male's name could appear both in the ‘father’ column and ‘son’ column. A depth-first tree traversal is performed for the XML document when shredding the document. The shredding marks a male either as a father or a son at a given moment but not both, which is accomplished by creating five shredding processes. Accordingly, at each process, a male member can only appear either as ‘father’ or as ‘son’.
As mentioned, an XML schema or DTD 100 is called recursive if it allows an element to contain another element with the same name as a descendent. An XML document instance 200 abiding to the XML schema or DTD 100 is therefore called a recursive XML document. The embodiments herein provide a presentation of the possible sequences of these recursive elements in an instance 200 of the recursive XML document 100 in an XPath format. A recursive shredding tree 300 defines the mapping 400 from the XML schema 100 to a table 450. The relationship is defined by a set of pairs of the XPath and the column number 455, 457. Two kinds of the nodes defined for the shredding tree 300 are (1) the cursor node 410, 430 corresponding to an element XPath (which could be a recursive XPath); and (2) the data node 420, 440 specifying a data value corresponding to an XPath to XML attribute value or XML text node value.
Preferably, there are three types of cursor nodes 410 or 430 for the recursive shredding tree 300. The cursor nodes 410, 430 are totally ordered, in the sense that all cursor nodes are on the same path from the root node 301. The three types of cursor nodes are: (1) a normal cursor node, which are cursor nodes before the first recursive cursor node; (2) a recursive cursor node, which is specified by a recursive XPath; and (3) a child cursor node of a recursive cursor node which will be defined with a relative XPath from the recursive cursor node. The mapping 400 of the shredding tree 300 in
A work area is a set of arrays comprising the non-completed records (or tuples) of the shredding data of a shredding tree 300. The work area arrays 610, 620, 630 corresponding to the shredding tree 300 are depicted in
A realized shredding tree is a shredding tree without any recursive cursor node, and is created from the recursive shredding tree 300 by replacing the recursive cursor node XPaths with the absolute path. In this context, an absolute path is a path that starts from the root node 301 and includes only “/” symbols (no “1”). This replacement occurs as follows: the first time a new recursive level is encountered in the XML document 200, a new realized tree 300 corresponding to that recursive level is created by replacing the recursive XPath expression with the current absolute path and any relative XPath expressions with the appropriate absolute XPath (computed by replacing the “.” symbol with the current path. The realized shredding tree 300 has the same identifier as the working area identifier, which enables the matching of a realized shredding tree 300 with its corresponding work area array 610, 620, or 630. There is one-to-many relationship from recursive shredding tree to realized shredding trees. This is in contrast to a non-recursive shredding process, where the original shredding tree is used directly, without the need to create realized shredding trees at system run-time.
A temporary table is defined based on the number of parameters of the structured query language (SQL) command specified by the action node and the data type of the parameters. The temporary table is a staging area in main memory (not shown) of the system (for example system 700 shown in
The finished records or tuples in the working areas are moved into the temporary table, and wait to be processed by the runtime module (not shown) to update the RDBMS 450 based on the parameterized SQL specified for the temporary table. There is a one-to-one mapping from the temporary table to the recursive shredding tree 300, which facilitates the management of the temporary table because there is a single shredding process that inserts records in a given temporary table.
In a detect recursive implementation, given a XML schema or DTD document 100, one can check if it is recursive by building a directed graph with element names as nodes and arcs from every element node A to every element node B that can appear as a child of A: the schema is recursive if and only if this graph contains cycles. This property enables a DTD parser 703 (of
In a preferred embodiment, data structure implementation, each recursive shredding tree has (1) a hashtable, named as working area hashtable, whereby the key of the hashtable is the identifier of the working area; and (2) a global lookup table used to map the cursor XPath to the shredding tree nodes.
The embodiments also provide a system 700 for performing a recursive shredding process as is illustrated in
With respect to the runtime methodology module 705 provided by the embodiments herein, the shredding process is defined as a process of retrieving portions of an XML document 200 into one or more relational database(s) 450. The process is specified by a set of recursive shredding trees 300. A shredding tree 300 is defined for all the shredding from the XML document 200 to a specific temporary table. A runtime engine (not shown) performs a depth-first tree traversal of the instance tree. During this process, each node of the XML tree 300 is visited. For each node (element, attribute, or text node) of the XML instance 200, the runtime engine computes the current XPath, and compares this XPath to the each of the XPaths stored in the global lookup table (not shown). For all of the matched XPaths, one will find all of the corresponding working areas for this absolute XPath. If any working area does not exist for this absolute XPath, one may create a new working area and have its identifier stored in the working area hashtable. This enables the efficient lookup of the relevant working area array 610, 620, or 630 in the future (when subsequent elements at the same recursive level are encountered).
The embodiments herein can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment including both hardware and software elements. A preferred embodiment is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc.
Furthermore, the embodiments herein can take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk. Current examples of optical disks include compact disk-read only memory (CD-ROM), compact disk-read/write (CD-R/W) and DVD.
A data processing system suitable for storing and/or executing program code will include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution.
Input/output (I/O) devices (including but not limited to keyboards, displays, pointing devices, etc.) can be coupled to the system either directly or through intervening I/O controllers. Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.
A representative hardware environment for practicing the embodiments herein is depicted in
The method may further comprise identifying all recursive cursor nodes 410, 430 and a recursive degree corresponding to the recursive shredding tree 300. Additionally, the method may further comprise mapping recursive elements of the recursive XML document 200 to shredding tree nodes of the recursive shredding tree 300. Preferably, the recursive shredding tree 300 comprises a working area hashtable. Moreover, the method may further comprise storing all XPaths of the recursive shredding tree 300 in a global lookup table; performing a depth-first tree traversal of the recursive shredding tree 300; computing a current XPath for each node in the recursive XML document 200; comparing the XPath to each of the stored XPaths in the global lookup table; and determining, for all matched XPaths, a corresponding set of arrays 610, 620, 630 comprising tuples of shredded data in the recursive shredding tree 300.
The foregoing description of the specific embodiments will so fully reveal the general nature herein that others can, by applying current knowledge, readily modify and/or adapt for various applications such specific embodiments without departing from the generic concept, and, therefore, such adaptations and modifications should and are intended to be comprehended within the meaning and range of equivalents of the disclosed embodiments. It is to be understood that the phraseology or terminology employed herein is for the purpose of description and not of limitation. Therefore, while the embodiments herein have been described in terms of preferred embodiments, those skilled in the art will recognize that the embodiments herein can be practiced with modification within the spirit and scope of the appended claims.
This application is a Continuation of U.S. application Ser. No. 11/303,432 filed Dec. 16, 2005, the complete disclosure of which, in its entirety, is herein incorporated by reference.
Number | Date | Country | |
---|---|---|---|
Parent | 11303432 | Dec 2005 | US |
Child | 12055009 | US |