1. Field of the Invention
This invention relates in general to database management systems performed by computers, and in particular to an optimized method and system for decomposing markup based documents, such as XML documents, into a relational database wherein multiple items are decomposed into the same table-column pair.
2. Description of Related Art
Databases are computerized information storage and retrieval systems. A Relational Database Management System (RDBMS) is a database management system (DBMS) which uses relational techniques for storing and retrieving data. RDBMS software using a Structured Query Language (SQL) interface is well known in the art. The SQL interface has evolved into a standard language for RDBMS software and has been adopted as such by both the American National Standards Organization (ANSI) and the International Standards Organization (ISO).
Extensible Markup language (XML) is a standard data-formatting mechanism used for representing data on the Internet in a hierarchical data format and for information exchange. An XML document consists of nested element structures, starting with a root element.
Decomposition of an XML document is the process of breaking the document into component pieces and storing those pieces in a database. The specification of the pieces and where they are to be stored is accomplished by means of a mapping document. Mapping document may be in the form of a set of XML schema documents that describe the structure and data types used in conforming XML instance documents. XML schema documents are augmented with annotations that describe the mapping of XML components to tables/columns in a relational database. Annotations are a feature of XML schema that provide for application-specific information to be supplied to programs processing the schema or instance documents.
At least one conventional decomposition product using the XML schemas is limited because it can only map a single element/attribute item into a table-column pair. The problem is best described by exemplary
The XML document of
The XML Schema associated with XML document of
For the XML document of
While there have been various techniques developed for decomposing and storing of markup based documents, such as XML documents, in a database, there is a need for a simple, optimized method which will allow decomposition of multiple information items from an XML document into the same table-column pair.
The foregoing and other objects, features, and advantages of the present invention will be apparent from the following detailed description of the preferred embodiments, which makes reference to several drawing figures.
One preferred embodiment of the present invention is a method for decomposing and storing the content of a markup based document into a relational database. For a schema of a markup based document a user identifies multiple items mapping into a same database table-column pair and associates a rowset to each such item and a corresponding database table. Next, the user creates a mapping document based on the schema of the markup based document with rowset-specific mapping annotations defining mapping of the items into columns of the rowsets. Decomposition of each item into a corresponding rowset column is accomplished by collecting the item content from the markup based document and storing it in the corresponding rowset column, for later storage in a database table.
Another preferred embodiment of the present invention is a system implementing the above-mentioned method embodiment of the present invention.
Yet another preferred embodiment of the present invention includes a computer usable medium tangibly embodying a program of instructions executable by the computer to perform method steps of the above-mentioned method embodiment of the present invention.
Referring now to the drawings in which like reference numbers represent corresponding parts throughout:
In the following description of the preferred embodiments reference is made to the accompanying drawings, which form the part thereof, and in which are shown by way of illustration specific embodiments in which the invention may be practiced. It is to be understood that other embodiments may be utilized and structural and functional changes may be made without departing from the scope of the present invention.
The present invention discloses a system, method and computer usable medium, tangibly embodying a program of instructions executable by the computer to perform method steps of the present invention, for decomposing and storing of markup based documents, such as Extensible Markup Language (XML) documents, in a relational database, where multiple information items from the XML document are decomposed into the same table-column pair. The method and system of the present invention may be used in a distributed computing environment in which two or more computer systems are connected by a network, such as World Wide Web, including environments in which the networked computers are of different type.
The preferred method embodiment of the present invention decomposes the XML documents into database. Decomposition of an XML document is the process of breaking the document into component pieces and storing those pieces in a database. The specification of the pieces and where they are to be stored is accomplished by means of a mapping document. The mapping document is a set of XML schema documents that describe the structure of conforming XML instance documents. The XML schemas are augmented with annotations that describe the mapping of XML components to tables/columns in a relational database. Annotations provide application-specific information to programs processing the schema or instance documents. In the present invention items are decomposed into rowsets rather than tables. A rowset is a group of related information items that form a meaningful row. Each database table can have one or more rowsets associated with it.
For the example of
Since the rowset information can be captured through annotations and the processing of a rowsets is intrinsic to the decomposition algorithm, the method and system of the present invention does not require creation and maintenance of any new database objects, therefore requiring no attention from the database administrators. Moreover, the application developer has full control over the names of the rowsets and can change them anytime, thus providing the flexibility and control.
The preferred aspects of the present invention use an XML Schema annotated with rowset-specific mapping information, provided in an annotation “table”. The annotation captures the name of the physical table in the database, the associated relational schema and one or more rowset names by which the table will be known in the annotated XML schema. The rowset names must be unique across the entire XML schema. For this reason, the annotation is considered as a global annotation, as a child annotation of the element “schema”.
To achieve the results described above, a user must annotate the original XML Schema of
As it can be seen from the annotated XML Schema, the user must first define the rowsets for the table “branches” using the annotation “table”. Three rowsets, namely, USAMainBranches, USASubBranches and NonUSABranches, are defined in
The annotated XML Schema of
The method of the present invention performs collection of element content and storage in the appropriate rowset according to the mapping information. In the example of
Flowchart of a computer-based method for decomposing and storing of a markup based document into a relational database, performed according to the preferred embodiments of the present invention, is illustrated in
Step 706 performs decomposition of each markup based document item into a corresponding rowset column, by calling a decomposition utility and inputting to it the annotated mapping document and the instance markup based document. Decomposition utility parses the markup based document and collects each item's content. It finds the item mapping information in the element/attribute declaration in the mapping document, which includes a rowset and column names. Item content is inserted into the corresponding rowset row buffer column, for later storage in the corresponding database table row. Parsing of the markup based document continues until all items that have mappings are found and placed in corresponding rowset's row buffers' columns, in step 708. At the end of decomposition, a union of all the rowsets associated with each database table is created, and a database table corresponding to each rowset is found in step 710. In step 712 all rowsets' row buffers are sent to the DBMS for insertion into or update of the corresponding database tables.
The processor 104 is connected to one or more electronic storage devices 106, such as disk drives, that store one or more relational databases 107. They may comprise, for example, optical disk drives, magnetic tapes and/or semiconductor memory. Each storage device permits receipt of a program storage device, such as a magnetic media diskette, magnetic tape, optical disk, semiconductor memory and other machine-readable storage device, and allows for method program steps recorded on the program storage device to be read and transferred into the computer memory. The recorded program instructions may include the code for the method embodiment of the present invention. Alternatively, the program steps can be received into the operating memory from a computer over the network.
Operators of the console terminal 108 use a standard operator terminal interface (not shown), to transmit electrical signals to and from the console 102, that represent commands for performing various tasks, such as search and retrieval functions, termed queries, against the databases 107 stored on the electronic storage device 106. In the present invention, these queries conform to the Structured Query Language (SQL) standard, and invoke functions performed by a DataBase Management System (DBMS) 112, such as a Relational DataBase Management System (RDBMS) software. In the preferred embodiments of the present invention, the RDBMS software is the DB2 product, offered by IBM for the AS400, OS390 or OS/2 operating systems, the Microsoft Windows operating systems, or any of the UNIX-based operating systems supported by the DB2. Those skilled in the art will recognize, however, that the present invention has application to any RDBMS software that uses SQL, and may similarly be applied to non-SQL queries.
Although the description of the preferred embodiments of the present invention was based on XML documents, the present invention is applicable to other types of markup based documents. It is presently being implemented in the DB2 V9 product, which can support rowsets and annotated XML schemas. However, it is useable by end users of any DBMS products providing XML support, for processing and decomposition of XML documents. It will preferably be used for developing applications for DB2 machines. However, the technology may be applied to any other database manager products, such as Oracle, Informix, Sybase, SQL Anywhere, and Microsoft SQL Server, and other relational products.
The foregoing description of the preferred embodiments of the invention has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed. Many modifications and variations are possible in light of the above teaching. It is intended that the scope of the invention be limited not by this detailed description, but rather by the claims appended hereto.