The present invention relates generally to Extensible Markup Language (XML) queries. More specifically, the present invention is related to a method for extracting tuple data from streaming, hierarchical XML data.
Querying streaming XML data has become an important task executed by modern information processing systems. XML queries specify patterns of selection predicates on multiple elements having some structural relationships, such as, for example, parent-child and ancestor-descendant. Streaming XML data arrives in an orderly format, typically as a sequence of Simple Application Program Interface (API) for XML events (i.e., SAX events or elements), where an SAX event or element may include a start element (SE), attributes, an end element (EE) and text. For example, if an XML data tree 11, in
In contrast to XML data that is parsed and stored in databases, streaming XML data can be most efficiently processed by consuming such SAX events without reliance on extensive buffering for storage of parsed data. Streaming XML data can be modeled as a tree, where nodes represent elements, attributes and text data, and parent-child pairs represent nestings between XML element nodes. XML data tree nodes are often encoded with positional information for efficient evaluation of their positional relationships. A core operation in XML query processing is locating all occurrences of a twig pattern, that is, a small tree pattern with elements and string values as nodes.
In mapping-based XML transformations, it is a common requirement that mapped values be extracted from streaming XML data sources. For example, tuple extraction is shown to be a core operation for data transformation in schema-mapping systems. XML tuple-extraction queries may comprise XML pattern queries with multiple extraction nodes. A tuple-extraction query can be represented as a labeled query tree with one or multiple extraction nodes. As used herein, a query tree node may be referred to as a ‘query node’ or a ‘QNode.’ The extracted values may be in the form of ‘flat tuples’ (i.e., data formatted into rows), which are then transformed to the target based on a mapping specification. However, tuple extraction may be a computationally-expensive operation in the integrated processing of XML data and relational data. For example, subsequent to the extraction of a tuple data stream from an XML data source, the tuple data stream may be sent to a relational operator for further processing, such as joining with other relational tables.
Recent efforts to improve streaming XML processing have produced XML filtering methods, such as XFilter, or have taken the approach of intentionally limiting XML processing operations to single extraction nodes by not including multiple extraction nodes. One method has utilized an algorithm known as ‘TurboXPath’ for tuple extraction from streaming XML data, but the application of TurboXPath has resulted in exponentially-increasing complexity when dealing with recursions. Moreover, although most Extensible Style Language Transformation (XSLT) XQuery engines can support tuple extraction queries, most XSLT/XQuery engines do not provide satisfactory performance as a consequence of efficiency and scalability problems. These efforts have, accordingly, produced limited results in attempting to provide efficient algorithms for tuple extraction.
As used herein, a virigule, or single forward slash, ‘/’ represents a parent-child relationship between a QNode and its parent, a double virigule ‘//’ represents an ancestor-descendant relationship, and a pound symbol ‘#’ represents an extraction node. Generally, a full match of a tuple-extraction pattern Q in an XML database D, modeled as a tree, may be identified by a mapping from nodes in Q to nodes in D, such that: (i) QNode predicates, if any, are satisfied by the corresponding database D nodes; and (ii) the ancestor-descendant structural relationships or the parent-child structural relationships between QNodes are satisfied by the corresponding database D nodes.
The full match of the tuple-extraction pattern Q can be represented as an n-ary relation, where each tuple (e1; e2; . . . ; en) comprises database D nodes. For the extraction nodes in the tuple-extraction pattern Q, corresponding text values are associated with the matched element nodes. The answer to a tuple-extraction query thus comprises the set of full-match tuples projected onto the extraction nodes.
A second tuple-extraction pattern 21, in
/dblp/inproceedings [title# and author# and year#]
For example, given the XML data tree 13 in
U.S. Pat. No. 7,219,091 “Method and system for pattern matching having holistic twig joins” discloses holistic twig joins as a method for improving the matching of XML patterns over XML data stored in databases. The holistic twig join method reads the entire XML data input and uses a chain of linked stacks to compactly represent partial results for root-to-leaf query paths. The query paths are composed to obtain matches for a twig pattern that may use ancestor-descendant relationships between elements. However, the method practiced in the reference assumes that the XML data has been parsed and has been encoded with region codes prior to pattern matching. A holistic twig-join algorithm is described, the algorithm designed to avoid irrelevant intermediate results and to achieve optimal worst-case I/O and CPU cost (i.e., a cost that is a linear function of the total size of input and output data).
Operation of the holistic twig-joining algorithm may be explained by reference to the XML data tree 13, to a query 23, shown in
It can thus be appreciated by one skilled in the art that use of a holistic twig-joining algorithm is not directly applicable to the extraction of tuple data from streaming, hierarchical XML data, because the algorithm requires valid cursor elements to begin execution. Additionally, such holistic cursors are “uncoordinated,” wherein each cursor aggressively searches for its next element without considering other cursors.
Another problem arises in that holistic twig-joining procedures typically require encoded XML element lists for operation, and thus may not operate on streaming XML data lists. However, it is not practical to adapt the holistic twig-joining algorithm to handle streaming XML by parsing the incoming XML data, storing the parsed XML data in temporary files, and then running the algorithm. This parsing method may cause unnecessary inputs/outputs (I/Os) because all the incoming data needs to be stored and then read back to run the holistic twig-joining algorithm. Additionally, the parsing method would require an impractically-large temporary storage device to handle the continuous streaming XML data.
From the above, it is clear that there is a need for an efficient and scalable method of extracting tuple data from streaming, hierarchical XML data without the need for parsing and storing large amounts of data.
In one aspect of the present invention, a method for querying streaming extensible markup language data comprises: routing elements to query nodes, the elements derived from the streaming extensible markup language data; filtering out elements not conforming to one or more predetermined path query patterns; adding remaining elements to one or more dynamic element lists; accessing a decision table to select and return a query node related to a cursor element from the dynamic element list; and processing the cursor element related to the returned query node to produce an extracted tuple output.
In another aspect of the present invention, a method for conducting a query to extract tuple data from a data warehouse database comprises: parsing data from the data warehouse database into a plurality of simple application program interface for extensible markup language (SAX) elements; discarding selected SAX elements, the selected SAX elements not conforming to path query patterns based on the query, the path query patterns ending at one or more query nodes corresponding to the SAX elements; appending at least one SAX element to a tail of a dynamic element list; returning a query node related to a cursor in the dynamic element list; and processing the cursor element via a process of holistic twig join matching.
In another aspect of the present invention, an apparatus for executing a query plan comprises: a data storage device; a computer program product in a computer useable medium including a computer readable program, wherein the computer readable program when executed on the apparatus causes the apparatus to: access an extensible markup language data parser to parse data from the data storage device into a plurality of elements; route the elements to query nodes; add the elements conforming to a query plan pattern to a dynamic element list; access a decision table to obtain a query node related to a cursor element from the dynamic element list; and process the cursor element to produce an extracted tuple output.
These and other features, aspects and advantages of the present invention will become better understood with reference to the following drawings, description and claims.
The following detailed description is of the best currently contemplated modes of carrying out the invention. The description is not to be taken in a limiting sense, but is made merely for the purpose of illustrating the general principles of the invention, since the scope of the invention is best defined by the appended claims.
As can be appreciated by one skilled in the art, many organizations and other repositories store data in XML format. Such data may include, for example, media articles, technical papers, Internet web documents, commodity purchase orders, product catalogs, client support documentation, and archived commercial transactions. The process of searching large data files, such as catalogs and lengthy articles, may require parsing of a document and performing a search for particular keywords or key phrases. Accordingly, the present invention generally provides a method for extracting tuple data from streaming, hierarchical XML data as may be adapted to information processing systems, where the parsing process and the algorithms may be implemented using C++.
The disclosed method and apparatus may include a block-and-trigger mechanism applied during holistic matching of XML patterns over XML data such that incoming XML data is consumed in a best-effort fashion without compromising the optimality of holistic matching, and such that cursors are coordinated. The blocking mechanism causes some incoming data to be buffered, but the disclosed method produces a ‘peak’ demand for buffer space that is smaller than buffer space required when parsing and storing the XML data in order to be able to execute a holistic twig-join algorithm, as may be found in conventional systems.
In an optional embodiment of the present invention, a pruning technique may be deployed to further reduce the buffer sizes in comparison to a process not using a pruning technique. In particular, a query-path pruning technique may function to ensure that each buffered XML element satisfies its query path. Additionally, an existential-match pruning technique may function to ensure that only those XML elements that participate in final results are buffered, so as to reduce memory or storage requirements, in comparison to the prior art.
As shown in
At any point during the matching of XML patterns over XML data, one or more cursors may be associated with an element list that has become empty, causing the respective cursor to be blocked. In response, the method of the present invention may function to continue processing the XML query and emitting results by matching XML patterns over XML data with other, non-blocked cursors. This serves to continue the process of consuming incoming elements, and thus reduces the need for additional buffering in comparison to conventional methods, thereby improving the response of the tuple-extraction query.
The StreamTX computer process 61 may further utilize special data structures to support the processing of streaming XML data. For example, dynamic element queues may be maintained in place of static input lists for QNodes. The use of dynamic element queues may enable an XML element queue to grow at the “tail” as new XML elements arrive in the form of SE events, and may provide for the XML element queue to shrink after a “head” element has been processed. In addition, the cursor on an element queue may be configured to either: (i) point to a valid XML element in the queue, or (ii) assume a blocked state when the XML element queue is empty.
If the XML data is not in the form of SAX events, an SAX parser may be used on the incoming XML data. XML elements whose ‘EE’ events have not arrived have open-end values. As can be appreciated by one skilled in the art, ancestor-descendant and parent-child relationships may be evaluated with open-ended region codes. Given two XML elements ‘u’ and ‘v’, if element ‘u’ is open-ended, then ‘u’ is an ancestor element of ‘v’ if u.start<v.start. If ‘u’ is not open-ended, then ‘u’ is an ancestor element of the element ‘v’ if u.start<v.start<u.end. The open-ended region code of an XML element may be completed when the ‘EE’ event for the open-ended element has arrived.
The code 69 for the core subroutine 65, ‘GetNextStream’, shown in
As provided for by code line five, the core subroutine 65 addresses the case where a returned QNode is a blocked QNode. If a subtree ‘qi’ is blocked, this does not necessarily mean that ‘Cqi’ is blocked—the blocking could be caused by a blocked cursor in the subtree ‘qi’. The initial part of the core subroutine 65, up to code line five, associates each of the child subtrees ‘qi’ with its ‘GetNextStream(qi)’ value ‘q′i’, which can be either a blocked QNode or the same as ‘qi’ which has a ‘solution extension.’ As understood in the relevant art, the node ‘qi’ has a solution extension if there is a solution for a sub query rooted at ‘qi’ composed entirely of the cursor elements of the query nodes in the sub query. The latter part of the core subroutine 65, beginning with code line eight, functions to coordinate QNodes. The start and end values of a blocked cursor, and the end value of an open-ended region code may be specified to be a predetermined constant having a value larger than the start and end values of any completed region code. This specified requirement serves to assure that an open-ended region covers all subsequent incoming elements.
The function arg minq′
Subsequent action may be taken, in code line thirteen, in accordance with criteria summarized in a decision table 71, shown in
An XML data tree 75, in
Elements in the XML data tree 75 have been assigned region codes and have been sorted according to their ‘start’ attributes in each list. Note that the elements for extraction QNodes (such as ‘qb’ and ‘qc’) are also associated with text values. There may be a cursor, denoted as ‘Cq’, for each QNode ‘q’. Each QNode cursor ‘Cq’ may point to an element in the corresponding input list of ‘q’. Accordingly, both the term ‘Cq’ and the term ‘element Cq’ are used herein to mean the element to which the cursor ‘Cq’ points. The region code of the cursor element may be accessed by invoking ‘Cq→start’, ‘Cq→end’, and ‘Cq→level’. The region code of the cursor element ‘Cq→advance( )’ can be invoked to forward the cursor to the next element in the list for the QNode ‘q’.
Running statistics for the XML data tree 75 and the data and query 77 are shown in a table 81 in
After each SAX event, the core subroutine 65 ‘GetNextStream(qa)’ may be called by the main process 63. Post-SAX event running statistics may be found in a table 83 in
When the event ‘SE(c1)’ occurs, all three cursors ‘Cqa’, ‘Cqb’, and ‘Cqc’ may be holding valid elements â2, b3, and ĉ1 respectively. The main process 63 may call the core subroutine 65 three times to consume the elements â2, b3, and ĉ1. It should be understood that the QNodes corresponding to the elements â2, b3, and ĉ1 are returned by cases ‘c8’, ‘c4’, and ‘c3’, respectively, in the table 71. This example shows that the main process 63 functions to consume incoming SAX events “greedily” based on the decision table 71, so that any buffer required to hold parsed elements may be kept as small as possible. In particular, the maximum length for the element queue of QNode ‘qa’ is ‘one’, although there are two a-elements in total. In contrast, conventional methods require that both a-elements be cached.
The core subroutine 65 may also function to ensure that elements are consumed with best efforts, without compromising the optimality of holistic twig joins. However, because holistic matching is a conservative approach in the action of blocking matching until a solution extension is found, undesirable element queues may result even with the process of waiting for blocked cursors, as described above. Accordingly, the disclosed method may include either or both of two pruning techniques, described below, to minimize the sizes of buffered element queues. It should be understood that, when a start-element event arrives, all ancestor elements of the start-element have also arrived, and that, when an end-element event arrives, all the descendant elements of the end-element have arrived.
Accordingly, when a start-element event occurs, the incoming element in the dynamic element list may be checked to determine whether there are corresponding ancestor elements to satisfy the query path. A query path is defined as a path from the root QNode to the QNode corresponding to the element in question. For example, for the QNode ‘qb’ in the query and input lists 77, the QNode query path is ‘//a/b #’. If the element being checked, such as an SAX element, does not satisfy any of one or more query path patterns ending at one or more query nodes corresponding to the element in question, the element can be discarded. This first pruning technique is denoted herein as ‘query-path pruning.’
Query-path pruning may be explained with reference to the table 83, in which both b-elements are buffered. By inspection it can be seen that, when the event ‘SE(b2)’ arrives the element ‘b2’ does not have a parent a-element. This occurs because all the start-element events of the b2-element ancestors have arrived when the event ‘SE(b2)’ arrives. Judgment may be made from these arrived ancestor elements, if any. In this particular example, the only ancestor element is ‘a1’, which is not a parent element of ‘b2’. As a result, the element ‘b2’ can be discarded and not added to the element queue ‘Cqb’.
Although the query-path pruning technique may check only the ancestor-descendant or parent-child relationship between an incoming element and the parent element queue of the incoming element, the incoming element may be checked to determine if there is a match for the query path from the root QNode to the QNode where the incoming element belongs. The query-path pruning technique can be implemented such that the cost of a match-test for each incoming element has a substantially constant value.
As can be appreciated by one skilled in the art, given a new incoming open-ended element ‘e’ to QNode ‘q’, ancestors of the open-ended element in the element queue of ‘parent(q)’ may likewise be concurrently open-ended elements and, moreover, the ancestor elements may be nested within each other. As a result, a stack of open-ended elements may be maintained for each element queue. An open-ended element may be removed from the stack upon the arrival of a corresponding ‘EE’ event. The top element of a stack maintained for an element queue of ‘parent(q)’ may be checked to determine whether the corresponding element has a parent or ancestor element in the element queue of ‘parent(q)’. It can further be appreciated that the process of query-path pruning ensures that each open or closed element ‘e’ buffered in element queues satisfies a corresponding query path. That is, there exist ancestor elements a1, a2, . . . an such that the element path a1→a2→ . . . →an→e satisfies the corresponding query path.
Additionally, when an end-element event occurs, and if the corresponding element does not have descendant elements to make up a match for the subtree, the element itself can be pruned as well at the corresponding descendant elements in the element queues. A second pruning technique, denoted herein as ‘existential-match pruning,’ is based on the criterion that there exists at least one subtree match for the closing element. It can be appreciated by one skilled in the art that there may be no need to instantiate all matching instances for the closing element to implement existential-match pruning.
A matching flag may be used for each non-leaf open-ended element in element queues to enable the existential-match pruning. The matching flag may be a Boolean value indicating whether the element has matching descendant elements according to the query pattern. To maintain the matching flag, the flags of all the open-ended elements along the query path may be updated whenever the ‘SE’ of a leaf QNode arrives.
To show that existential-match pruning can help reduce element buffer size, consider an incoming XML as a path with three elements: ‘a1→a2→b1’, where ‘a1’ comprises a root element and ‘b1’ comprises a the leaf element, and consider the query ‘//a[b#]//c#’, denoted as query 77 in
It should be understood that cascaded pruning of descendant elements may be applied when the descendant elements do not match other valid ancestor/parent elements. Additionally, if cascaded pruning is applied, existential-match pruning may also be executed as pruned descendant elements may be clustered at the tails of corresponding element queues. The existential-match pruning technique functions to ensure that all the closed elements buffered in the queues participate in final results of tuple extraction.
The disclosed process for querying streaming XML data may best be described with reference to a flow diagram 90, shown in
The SAX elements may be filtered by means of a query plan filter, at step 99. The filter is based on the pattern of a query plan, and serves to eliminate data not conforming to one or more predetermined query plan patterns. Non-conforming elements may be discarded, at step 101, and additional data inputted, at step 91. Conforming elements may be added or appended to the tail of each of one or more dynamic element lists having the same tag as the new element, at step 103. A determination may be made, at decision box 105, as to whether the corresponding cursor Cq has changed. Since a cursor points to the head of an element list, a cursor change may occur when a new element has been added or appended to an empty element list. If the cursor Cq is unchanged, the process may proceed to input additional XML data, at step 91.
If an incoming event or element has been encountered, at decision box 105, the cursor Cq may have changed and a decision table may be used to return a query node whose cursor element is being processed. That is, a non-blocked query node may be returned, even if some query nodes remain in a blocked state. The resultant query node is returned, per the decision table, and a determination is made, at decision box 109, as to whether the corresponding query node cursor is in a blocked state. If the corresponding query node cursor is blocked, the process may resume by inputting additional XML data, at step 91. If the corresponding query node cursor is not blocked, the cursor element may be processed using a holistic twig join process, at step 111, and additional XML data may be obtained, at step 91. After the cursor element has been processed, the cursor element may be discarded, and the cursor may point to the next element in the element list. If the element list has only a single element, the cursor may become blocked at this step.
Embodiments of the invention can take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment containing both hardware and software elements. In a preferred embodiment the invention is implemented in software that includes, but is not limited to, firmware, resident software, and microcode. Furthermore, the invention can take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a propagation medium. Examples of computer-readable media include: a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk, and an optical disk. Current examples of optical disks include: compact disk-read only memory (CD-ROM), compact disk-read/write (CD-R/W) and (digital versatile disk) DVD.
A data processing system suitable for storing and/or executing program code will include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution.
Input/output devices (including, but not limited to, keyboards, displays, and pointing devices) may be coupled to the system either directly or through intervening I/O controllers. Network adapters may also be coupled to the system to enable coupling of the data processing system to other data processing systems or to remote printers or to storage devices through intervening private or public networks via transmission paths such as digital and analog communication links. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.
It should be understood that, while the invention has been described in the context of fully functioning computers and computer systems, those skilled in the art will appreciate that the various embodiments of the invention are capable of being distributed as a software and firmware product in a variety of forms, and that the invention applies equally regardless of the particular type of signal bearing medium used to convey the distribution. Moreover, the foregoing relates to exemplary embodiments of the invention and that modifications may be made without departing from the spirit and scope of the invention as set forth in the following claims.
Number | Date | Country | |
---|---|---|---|
Parent | 11835901 | Aug 2007 | US |
Child | 12134080 | US |