The present invention relates generally to computer implemented database systems and, more particularly, to a method and system for querying structured documents stored in their native format in a database system.
Structured documents are documents which have nested structures. Documents written in Extensible Markup Language (XML) are structured documents. XML is quickly becoming the standard format for delivering information on the World Wide Web because it allows the user to design a customized markup language for many classes of structure documents. XML supports user-defined tabs for better description of nested document structures and associated semantics, and encourages separation of document contents from browser presentation.
As more and more businesses present and exchange data in XML documents, the challenge is to store, search, and retrieve these documents using the existing relational database systems. A relational database management system (RDBMS) is a database management system which uses relational techniques for storing and retrieving data. Relational databases are organized into tables, which consist of rows and columns of data. A database will typically have many tables and each table will typically have multiple rows and columns. The tables are typically stored “on disk,” i.e., on direct access storage devices (DASD), such as magnetic or optical disk drives for semi-permanent storage.
Some relational database systems store an XML document as a BLOB (Binary Large Objects) or map the XML data to rows and columns in one or more relational tables. Both of these approaches, however, have serious disadvantages. First, an XML document that is stored as a BLOB must be read and parsed before it can be queried, thereby making querying costly and time consuming. Second, the mapping process is burdensome and inefficient, especially for large XML documents, because mapping XML data to a relational database can result in a large number of columns with null values (which wastes space) or a large number of tables (which is inefficient). Furthermore, by storing an XML document in a relational database, the nested structure of the document is not preserved. Thus, parent-child(ren) relationships are difficult to reconstruct.
According, there is a need for an improved method and system for querying structured documents stored in their native formats within a database system. The method and system should be integrated (or capable of being integrated) with an existing database system in order to use the existing resources of the database system. The present invention addresses such a need.
The present invention is directed to a method and system for querying a structured document stored in its native format in a database, wherein the structured document comprises a plurality of nodes that form a hierarchical node tree. The method comprises providing at least one child pointer in each of the plurality of nodes, wherein the at least one child pointer points to a corresponding child node of the plurality of nodes and storing a hint in each of the at least one child pointers. The hint is then utilized to navigate the hierarchical node tree during query evaluation.
Through aspects of the present invention, a structured document is parsed and a plurality of nodes is generated to form a hierarchical node tree representing the structured document. The plurality of nodes is stored in one or more records. Each node that has children includes a plurality of child pointers. Stored in each child pointer is a hint related to the child node to which the child pointer points. In a preferred embodiment, the hint is a portion of the child node's name. By storing the hint in the child pointer, a database management system (DBMS) navigating the node tree during query evaluation follows those pointers that contain a hint that matches a query. Pointers that contain a non-matching hint can be skipped. Accordingly, query processing is more efficient.
The present invention relates generally to computer implemented database systems and, more particularly, to an improved method and system for querying structured documents stored in their native format in a database system. The following description is presented to enable one of ordinary skill in the art to make and use the invention and is provided in the context of a patent application and its requirements. Various modifications to the preferred embodiment and the generic principles and features described herein will be readily apparent to those skilled in the art. For example, the following discussion is presented in the context of a DB2® database environment available from IBM® Corporation. It should be understood that the present invention is not limited to DB2 and may be implemented with other relational database systems or with other native XML database systems. Thus, the present invention is to be accorded the widest scope consistent with the principles and features described herein.
According to a preferred embodiment of the present invention, an XML document is represented by a hierarchical node tree comprising a plurality of nodes. The plurality of nodes are stored in one or more records, which in turn are stored on one or more pages. Each node includes a plurality of child pointers that point to the node's child nodes. Stored in each child pointer is a hint about the child node to which the child pointer points. During query evaluation, the execution of an Xpath or Xquery expression on an XML document translates into navigating the nodes of the node tree. The hint in the child pointer is used to pre-screen child nodes so that those children that may be of interest are accessed.
To describe further the present invention, please refer to
The server computer 104 uses a data store interface (not shown) for connecting to the data sources 106. The data store interface may be connected to a database management system (DBMS) 105, which supports access to the data store 106. The DBMS 105 can be a relational database management system (RDBMS), such as the DB2® system developed by IBM Corporation, or it also can be a native XML database system. The interface and DBMS 105 may be located at the server computer 104 or may be located on one or more separate machines. The data sources 106 may be geographically distributed.
The DBMS 105 and the instructions derived therefrom are all comprised of instructions which, when read and executed by the server computer 104 cause the server computer 104 to perform the steps necessary to implement and/or use the present invention. While the preferred embodiment of the present invention is implemented in the DB2® product offered by IBM Corporation, those skilled in the art will recognize that the present invention has application to any DBMS, whether or not the DBMS 105 is relational or native. Moreover, those skilled in the art will recognize that the exemplary environment illustrated in
According to the preferred embodiment of the present invention, the DBMS 105 includes an XML Storage mechanism 200 that supports the storage of XML documents in their native format on disk. Storing data “on disk” refers to storing data persistently, for example, in the data store 106.
According to the preferred embodiment of the present invention, the node tree 208 preserves the hierarchical structure of the XML document 202 and also preserves the document order, i.e., the order of the nodes. The plurality of nodes forming the node tree 208 is stored in an XML Record 500 in step 310, and each record 500 is, in turn, stored on a page. The XML Record 500 is similar to a standard database record that stores relational data except that the XML Record 500 stores XML data. Storing the plurality of nodes in a record 500 is advantageous because a record 500, like an XML document, is variable in length. Records also can be re-directed, providing a layer of indirection that insulates pointers into a tree, from e.g., within the tree itself, from indices, or from an anchor table (described below), if the record is moved to a different page. Moreover, the infrastructure for fixed page buffer management, recovery, utilities (backup/restore), logging, locking, and replication can be reused.
To explain further the details of the present invention, please refer to
As is shown, each node 508, 508a, 508b comprises an array of child pointers 510. Each child pointer 510 generally points to a node slot 507, which in turn, points to a node, e.g., 508b, corresponding to the child. Thus, for example, in
A node tree 208 representing an XML document 202 is identified by a root node 508, which is the topmost node 508 in the node tree 208. All other nodes 508 in the node tree 208 are children or descendants of the root node 508. The XID of the root node 508 is referred to as a rootID, and comprises the record slot number 505 pointing to the XML Record 500 and the node slot number 507 pointing to the root node 508.
In another preferred embodiment where the DBMS 105 is a relational database management system, the rootID is stored in an anchor table.
Referring again to
Referring now to
As is described above, a node slot entry (e.g., 507b) can point to a node (e.g., 508b) that resides within the XML Record 500a, or to a node (e.g., 508c) that resides in a different XML Record 500b. Accordingly, node slot entries (507) are large because they need to be able to point to nodes (508c) in other XML Records 500b in addition to pointing to nodes (508a, 508b) in the local XML Record 500a. In a preferred embodiment, the entry (e.g., 507a) pointing to a local node (e.g., 508a) is an offset, while the entry (e.g., 507c) pointing to a node (e.g., 508c) in another XML Record 500b is the node's XID. Thus, for example, the entry in node slot 4 (507c) is the XID of Node C, that is, Node C's record slot number and the node slot number.
In the above described example, Nodes A, B and C (508a, 508b, 508c) were distributed over a plurality of XML Records (500a, 500b). In another embodiment, child pointers, e.g., 510a, 510c, of one node, 508a, can be distributed over a plurality of nodes in the same or different XML Records (500a, 500b). This is necessary if the number of child pointers (510a, 510c) for a node (508a) do not fit in the node (508a). Referring now to
For example, referring again to
The structure and contents of the node 508 will now be described with reference to
The child pointer section 803 comprises one of at least three formats. According to a preferred embodiment of the present invention, there are at least three classes of children:
In the array of child pointers 510 of the node 508, the order in which children are stored in the node is: “internal” children first, followed by “attribute” children second, and then “ordered” children. This ordering is based on the presumption that the number of internal children is far fewer than the number of attribute children, which in turn is far fewer than the number of ordered children. Thus, child pointers 510 pointing to internal and attribute children will typically be in the main parent node 508, as opposed to a continuation node 514.
Referring again to
In a preferred embodiment, a hint about the child node (804b) is stored in the child pointer 510a itself to facilitate navigation during query evaluation. As those skilled in the art are aware, when an XML document 202 is stored in its native format in a database, query evaluation, e.g., execution of an Xpath or Xquery expression, typically involves navigating the nodes 508 of the XML document 202 to find values that satisfy a query. While the cost (resources and time) of navigating between nodes 508 in the same XML Record 500 is relatively inexpensive, the cost of navigating between nodes 508 in different XML Records 500 on different pages 502 is substantial.
The method and system of the present invention addresses this issue. Please refer to
In the preferred embodiment, the hint (804b) is a portion of the child node's name, e.g., the element or attribute name, because the query most likely comprises tag names and namespaces. Because the hint (804b) is a portion of the node name, the DBMS 105 will seek a partial match of the query. If the hint (804b) at least partially matches the query (in step 1010), there is some likelihood that the child node (508b) is of interest, i.e., the child node (508b) may satisfy the query. Accordingly, the DBMS 105 will follow the child pointer 510a to the child node 508b, and perform a full check, i.e., check the node name and namespace to determine if the child node 508bsatisfies the query, via step 1012. If the hint (804b) does not match the query, the child is of no interest, and therefore, the DBMS 105 need not follow the pointer 510a.
Thereafter, in step 1014, if the node 508ahas more child pointers (step 1014), the DBMS 105 proceeds to the next child pointer in step 1016, and steps 1008 through 1014 are repeated. When all child pointers in a node have been processed, the DBMS 105 navigates to the next node in step 1018, and steps 1006 through 1018 are repeated until the node tree has been traversed.
By utilizing the hint (804b) stored in the child pointer 510a before following the pointer 510a to the child node 508b, the DBMS 105 avoids visiting nodes that cannot satisfy the query. Accordingly, instead of navigating over the entire node tree 208, the DBMS 105 is able to prune branches of no interest and to navigate to those children that match or partially match the query. Thus, navigation is more efficient and significantly faster.
Referring again to
The third format (808) is applied when the child pointer 510 itself fully describes the child and its value. In this case, the data in the “pointer” 510 comprises:
An improved method and system for querying a structured document stored in its native format in a database is disclosed. Through aspects of the present invention, a structured document is parsed and a plurality of nodes is generated to form a hierarchical node tree representing the structured document. The plurality of nodes is stored in one or more records. Each node that has children includes a plurality of child pointers. Stored in each child pointer is a hint related to the child node to which the child pointer points. In a preferred embodiment, the hint is a portion of the child node's name. By storing the hint in the child pointer, a database management system (DBMS) navigating the node tree during query evaluation follows the pointers that contain a hint that matches a query and can skip over those that contain a non-matching hint. Accordingly, query processing is more efficient.
Although the present invention has been described in accordance with the embodiments shown, one of ordinary skill in the art will readily recognize that there could be variations to the embodiments and those variations would be within the spirit and scope of the present invention. Accordingly, many modifications may be made by one of ordinary skill in the art without departing from the spirit and scope of the appended claims.
Number | Name | Date | Kind |
---|---|---|---|
4907151 | Bartlett | Mar 1990 | A |
5193184 | Belsan et al. | Mar 1993 | A |
5283830 | Hinsley et al. | Feb 1994 | A |
5412807 | Moreland | May 1995 | A |
5561786 | Morse | Oct 1996 | A |
5644776 | DeRose et al. | Jul 1997 | A |
5652858 | Okada et al. | Jul 1997 | A |
5671403 | Shekita et al. | Sep 1997 | A |
5673334 | Nichani et al. | Sep 1997 | A |
5758361 | Van Hoff | May 1998 | A |
5787449 | Vulpe et al. | Jul 1998 | A |
5878415 | Olds | Mar 1999 | A |
5893086 | Schmuck et al. | Apr 1999 | A |
5920861 | Hall et al. | Jul 1999 | A |
5995952 | Kato | Nov 1999 | A |
6044373 | Gladney et al. | Mar 2000 | A |
6081810 | Rosenzweig et al. | Jun 2000 | A |
6085193 | Malkin et al. | Jul 2000 | A |
6101558 | Utsunomiya et al. | Aug 2000 | A |
6236996 | Bapat et al. | May 2001 | B1 |
6237099 | Kurokawa | May 2001 | B1 |
6249844 | Schloss et al. | Jun 2001 | B1 |
6308173 | Glasser et al. | Oct 2001 | B1 |
6334130 | Tada et al. | Dec 2001 | B1 |
6336114 | Garrison | Jan 2002 | B1 |
6366934 | Cheng et al. | Apr 2002 | B1 |
6381602 | Shoroff et al. | Apr 2002 | B1 |
6421656 | Cheng et al. | Jul 2002 | B1 |
6438576 | Huang et al. | Aug 2002 | B1 |
6457103 | Challenger et al. | Sep 2002 | B1 |
6480865 | Lee et al. | Nov 2002 | B1 |
6487566 | Sundaresan | Nov 2002 | B1 |
6502101 | Verprauskus et al. | Dec 2002 | B1 |
6519597 | Cheng et al. | Feb 2003 | B1 |
6584458 | Millett et al. | Jun 2003 | B1 |
6631371 | Lei et al. | Oct 2003 | B1 |
6658652 | Alexander et al. | Dec 2003 | B1 |
6798776 | Cheriton et al. | Sep 2004 | B1 |
6820082 | Cook et al. | Nov 2004 | B1 |
6836778 | Manikutty et al. | Dec 2004 | B2 |
6853992 | Igata | Feb 2005 | B2 |
6901410 | Marron et al. | May 2005 | B2 |
6922695 | Skufca et al. | Jul 2005 | B2 |
6938204 | Hind et al. | Aug 2005 | B1 |
6947945 | Carey et al. | Sep 2005 | B1 |
6959416 | Manning | Oct 2005 | B2 |
7016915 | Shanmugasundaram et al. | Mar 2006 | B2 |
7031962 | Moses | Apr 2006 | B2 |
7043487 | Krishnamurthy et al. | May 2006 | B2 |
7353222 | Dodds et al. | Apr 2008 | B2 |
20010018697 | Kunitake et al. | Aug 2001 | A1 |
20020038319 | Yahagi | Mar 2002 | A1 |
20020099715 | Jahnke et al. | Jul 2002 | A1 |
20020103829 | Manning et al. | Aug 2002 | A1 |
20020111965 | Kutter | Aug 2002 | A1 |
20020112224 | Cox | Aug 2002 | A1 |
20020123993 | Chau et al. | Sep 2002 | A1 |
20020133484 | Chau et al. | Sep 2002 | A1 |
20020156772 | Chau et al. | Oct 2002 | A1 |
20020156811 | Krupa | Oct 2002 | A1 |
20020169788 | Lee et al. | Nov 2002 | A1 |
20030014397 | Chau et al. | Jan 2003 | A1 |
20030028495 | Pallante | Feb 2003 | A1 |
20030204515 | Shadmon et al. | Oct 2003 | A1 |
20030208490 | Larrea et al. | Nov 2003 | A1 |
20040044959 | Shanmugasundaram et al. | Mar 2004 | A1 |
20040128615 | Carmel et al. | Jul 2004 | A1 |
20040193607 | Kudo et al. | Sep 2004 | A1 |
20040243553 | Bailey | Dec 2004 | A1 |
Number | Date | Country |
---|---|---|
09928730 | Apr 2000 | EP |
WO 03030031 | Apr 2003 | WO |
WO2004036417 | Apr 2004 | WO |
Number | Date | Country | |
---|---|---|---|
20050050011 A1 | Mar 2005 | US |