The present invention generally relates to analyzing XML documents and, more particularly, to supporting aggregation operations that exploit structural properties of the XML documents.
Throughout the instant disclosure, numerals in brackets—[]— are keyed to the list of numbered references towards the end of the disclosure.
Over the years, structured aggregation operations for Online Analytical Processing (OLAP) have been studied extensively. Traditional OLAP systems view data using a logical multi-dimensional representation. Vassiliadis and Sellis present a survey of logical models for OLAP computations [6]. Gray et al. first introduced the OLAP CUBE operator [4]. Most database vendors support OLAP in their database systems and most of the OLAP operators, such as, GROUP BY, ROLLUP, DRILLDOWN, and CUBE are supported in the SQL standard as well [2, 6].
Recently, value-based grouping has been investigated for Xquery [1,3]. This proposal, however, can not express structural grouping operations.
A need has been recognized in connection with improving upon the shortcomings and disadvantages presented by conventional efforts.
In accordance with at least one presently preferred embodiment of the present invention, there is broadly contemplated the introduction of extensions to query processing systems for XML documents that allow the analysis of such documents via grouping and aggregation operations. There is assumed the existence of an analysis module for extracting information on how parts of an XML document interrelate with other parts (e.g., document node hierarchies). This information is then used together with a user query (that is extended to include aggregation operators) in order (1) to partition the nodes of the document in various ways and (2) to compute and output the aggregation value of each such partition. To these ends, there are provided new query operators and extensions to query processing systems comprising a hierarchical node list generator and a hierarchical node list processor. The former takes the grouping information from the query as input and generates document node partitionings. The latter takes the node partitionings as input and computes aggregation values for each partition and generates a query result that is returned to the user.
For the partitioning of the nodes, two types of operators are broadly contemplated herein: grouping by multiple independent parts of the document (in order to analyze various scenarios and viewpoints) and grouping by dependent parts of the document (in order to analyze a document at different “zoom levels”). It is to be appreciated that all extensions are compatible with (but orthogonal to) existing query processing systems and algorithms.
In summary, one aspect of the invention provides a system for performing structured aggregation of XML documents, the system comprising: an arrangement for reading scoped dimension information for an input XML document; an arrangement for accessing tree information of an input XML document; an arrangement for parsing a user query; an arrangement for executing the user query; and an arrangement for returning a query result as a hierarchical structure.
Another aspect of the invention provides a method of performing structured aggregation of XML documents, the method comprising the steps of: reading scoped dimension information for an input XML document; accessing tree information of an input XML document; parsing a user query; executing the user query; and returning a query result as a hierarchical structure.
Furthermore, an additional aspect of the invention provides a program storage device readable by machine, tangibly embodying a program of instructions executed by the machine to perform method steps for performing structured aggregation of XML documents, the method comprising the steps of: reading scoped dimension information for an input XML document; accessing tree information of an input XML document; parsing a user query; executing the user query; and returning a query result as a hierarchical structure.
For a better understanding of the present invention, together with other and further features and advantages thereof, reference is made to the following description, taken in conjunction with the accompanying drawings, and the scope of the invention will be pointed out in the appended claims.
Some background information of interest may be found in the copending and commonly assigned U.S. Patent Application entitled “Methods and Systems for Analyzing XML Documents”, which is filed concurrently with the instant application and which is hereby fully incorporated by reference as if set forth in its entirety herein.
A general architecture in which the embodiments of the present invention may preferably be embedded, can be seen in
At least one presently preferred embodiment of the present invention relates to inner workings of the query processor (114) with respect to aggregation queries. It assumes the document parser (102) is provided by some other means, for example by an XML document parser. It also assumes that the scoped dimension analyzer (110) and the analytical model builder (120) are provided by some other means. It also assumes that the part of the query processor (114) that handles non-aggregation queries is provided by some other means, for example an XQuery processor for XML documents. Further discussion of these various components is not included herein as these components are peripheral to the present discussion.
In the following detailed description, it is assumed that the hierarchical document is an XML document and that the user queries are XQuery statements. This, however, is only for the purpose of illustration and does not limit the embodiments of the present invention solely to such types of documents or queries.
Each input XML document is assumed to be represented internally by a graph structure which includes nodes that represent the tags and edges that represent the nesting of tags. As an example for illustrative purposes herein, it can be assumed the input document is describing employee information, including a reporting structure, as shown in
The scoped dimension descriptor structure (112 in
It is assumed, for the present discussion, that the scoped dimension descriptor (112) can be logically represented by a graph structure as shown in
“/Division/Department/Department” and
“/Division/Department/Employee” are independent dimensions while
“/Division” and “/Division/Department” are dependent dimensions.
The analytical model in the present example could be represented by pointers pointing from each node of the dimension descriptor structure to one or more nodes in the original document tree. For example, node “Division” in
It is assumed, for the present discussion, that a query over XML documents includes a part that describes which nodes of the document to select and a part that describes what to do with those selected nodes in order to generate a query result. In XQuery, for example, the
FOR $e IN //EMPLOYEE
WHERE $e/SALARY>40,000
ORDER BY $e/SALARY
RETURN<result> <name> $e/name </name> <salary> $e/salary </salary> </result>
selects EMPLOYEE nodes whose salary is greater than 40,000 and sorts them by salary. In the
In the context of the embodiments of the present invention, it may be assumed that XQuery is extended by aggregation operators that affect both the selection part and the processing part. For selection, it may be assumed that additional operators are available that allow the grouping of nodes. In case of XQuery, the extended query expression may appear as follows (keeping in mind, of course, that any precise syntax employed need not be limited to the syntax used in this example):
FOR $e IN //EMPLOYEE
WHERE $e/SALARY>40,000
GROUP BY $e/GENDER AS $g
RETURN
This query would first group all employees by gender (line 3) and then assemble the result group-wise in the loop in line 6. The result could be, for example,
It is to be appreciated that the grouping (in this example, by gender) is an independent step from the computation of the aggregation function (in this example, AVG). This means that different types of groupings can be combined with different types of aggregation functions. In accordance with at least one preferred embodiment of the present invention, four different types of grouping operators are defined. All grouping operators require the analytical model (122) for proper grouping. The aggregation functions are not limited by the embodiments of the present invention. Any aggregation function that was used in the original query language can be used in the extended version as discussed herein.
When no specific result format is required, the above query can be simplified to:
FOR $e IN //EMPLOYEE
WHERE $e/SALARY>40,000
GROUP BY $e/GENDER AS $g
RETURN
<result> AVG($g/SALARY) AS avgSalary </result>
Here, $g/SALARY will be expanded implicitly into a loop as shown above (without the gender component in the result set).
As another example, the GROUP BY clause in the above example may be replaced by
GROUP BY ($e/GENDER, $e USING height($e) AS level)
In this case, the employees will be grouped first by gender and then within a gender group, by the height in the document tree (i.e., in this case the level in the organizational hierarchy). It is to be appreciated that the use of $e in both grouping expressions ensures that the same node is referred to. It is also to be appreciated that value-based (e.g., gender) and structure-based (e.g., height) attributes can be mixed arbitrarily in the grouping expression.
The examples just above used independent dimensions to perform the grouping and aggregation. As another example, aggregation can be performed using the dependent dimensions of XML documents. The expression
GROUP BY COLLAPSE ($r//SALARY, $r//DIVISION)
will group employee salaries first by lowest level SALARY nodes' values, then by the next higher level nodes' values, and so on, until the DIVISION nodes. In other words, when computing the average value again, first the individual salaries would be returned, then the average salaries on the lowest department levels, then the next higher department levels, and so on, until the average salaries per division.
Overall, four operators for grouping are broadly contemplated in accordance with at least one embodiment of the present invention. Each operator is listed in more detail in the following together with an example using
1. GROUP BY(p1 USING f1 AS $t1, . . . , pn USING fn AS $tn) AS $g
The regular GROUP BY operator takes n independent dimensions as arguments and generates a list of document node lists by subsequently replacing each grouping expression “pi USING fi AS $ti” by a list of nodes (in document order) addressed by pi and with the same value under function fi. If no function is given, the value of pi itself is used. The “AS” expressions are used for naming the expressions preceding them.
GROUP BY($r//Type, $r//Salary)→(((13, 14),(17, 18)),((7, 8)),((10, 11)))
The numbers in the result list indicate the node ids from
2. GROUP BY EXPAND((p1, p2) USING f1 AS $t1) AS $g
The expanding GROUP BY operator takes two dependent dimensions p1 and p2 which indicate the start and end of the expansion (p2 must be in the scope of p1). It generates a list of node lists by subsequently grouping nodes selected by p2 based on the values of the current expansion of p1 towards p2 as computed by f1 (e.g., if p1=“$r//A” and p2=“$r//A/B/C”, then expansions of p1 towards p2 are $r//A, $r//A/B, and $r//A/B/C). The other syntactic components are as described before.
GROUP BY EXPAND(($r//Division, $r//Salary))→((8, 11, 14, 18))(8, 11, 14), (18)),(8, 11, 14), (18)), ((8), (11), (14), (18)))
3. GROUP BY COLLAPSE((p1, p2) USING f1 AS $t1) AS $g
The collapsing GROUP BY operator takes two dependent dimensions p1 and p2 which indicate the start and end of the reduction (p1 must be in the scope of p2). It generates a list of node lists similar to the expanding GROUP BY but starting at the lowest level and “expanding backwards” towards the higher levels.
GROUP BY COLLAPSE(($r//Salary, $r//Division))→((8), (11), (14), (18)),(8, 11, 14), (18)),(8, 11, 14), (18)) ((8, 11, 14, 18)))
4. GROUP BY TREE(p1 USING f1 AS $t1, . . . , pn USING fn AS $tn) AS $g
The tree-based GROUP BY operator is similar to the basic GROUP BY operator but in addition to replacing each grouping expression with nodes based on some function value, it also replaces each grouping expression with NULL which essentially means “any value”. The grouping expressions (a,b,c) will therefore be replaced by (NULL, NULL, NULL), (a1, NULL, NULL), (a2, NULL, NULL), . . . , (NULL, b1, NULL), . . . , (a1, b1, NULL), . . . , (a1, bj, ck).
GROUP BY TREE($r//Type, $r//Salary)→(((7,8), (10,11), (13,14), (17,18)), ((7,8), (13,14), (17,18)), ((10,11)), ((7,8)), ((10,11)), ((13,14),(17,18)), ((13,14),(17,18)), ((7,8)), ((10,11)))
Any user query with or without aggregation extensions may be transformed into a query tree that is an abstract representation of the query. The extensions proposed in accordance with at least one embodiment of the present invention can be realized as one more operand branch (e.g., for GROUP BY) in the node selection component of the query tree. This way, any existing query optimization techniques can be reused without change for the new extended aggregation operators.
In accordance with a preferred embodiment of the present invention, a query processor (114) and extensions (116) are laid out in greater detail in
Both the normal node list generated by the node generator (200) and the hierarchical node list generated by the hierarchical node list generator (202) are then merged into one list by a node merger (204). The merging can be accomplished by simple concatenation of the lists or through a more complex operation. It is important, however, that the result is a hierarchical node list.
Finally, a node processor (206) preferably takes the merged node list and aggregation information from the query and generates the query result (108). The aggregation information from the query is extracted by the aggregation function analyzer (210) and includes information on the aggregation operation to use (e.g., AVG, MAX, MIN, SUM) and naming information to generate a correctly formatted result.
The disclosure now turns to a detailed discussion of preferred embodiments of the hierarchical node list generator (202) and the node processor (206).
The hierarchical node list generator (202) receives aggregation information from the aggregation information analyzer (208) and an analytical model (122) derived from the XML document. It then determines which of the four supported grouping operators is present (if any) in the aggregation information. If none is found, the resulting hierarchical node list is empty. Otherwise, one of the following methods (based on the grouping operator) is executed to generate a hierarchical node list.
GROUP BY Operation
Referring to
GROUP BY EXPAND Operation
Referring to
This means that f is applied to the nodes of all current node pointers and the resulting value determines which node pointers will be grouped together in one partition set. Once this partitioning is done, a new list of node lists M is initialized (616), populated (618), and added to L as a new element (620). The population encompasses, for each partition set, adding a list of document nodes N1, . . . , Nk to M such that each Ni is a node with path p2 and located in the subtree of one of the partition set nodes. Once M was added to L, it is again checked whether some node pointers can be further updated or whether all pointers have reached their final node (608). For this operation, L is a list of lists of node lists.
GROUP BY COLLAPSE Operation
The GROUP BY COLLAPSE operation is similar to the GROUP BY EXPAND operation but instead of considering path expansions, it considers path contractions. Referring to
GROUP BY TREE Operation
The GROUP BY TREE operation is similar to the GROUP BY operation but each component of the grouping can also assume the special value NULL. Referring to
Once the grouping of the nodes is completed, and the hierarchical and regular node lists are merged (204), the nodes are processed in order to compute the aggregation values for each node list. Referring to
It is to be understood that the present invention, in accordance with at least one presently preferred embodiment, includes an arrangement for reading scoped dimension information for an input XML document, an arrangement for accessing tree information of an input XML document, an arrangement for parsing a user query, an arrangement for executing the user query, and an arrangement for returning a query result as a hierarchical structure. Together, these elements may be implemented on at least one general-purpose computer running suitable software programs. These may also be implemented on at least one Integrated Circuit or part of at least one Integrated Circuit. Thus, it is to be understood that the invention may be implemented in hardware, software, or a combination of both.
If not otherwise stated herein, it is to be assumed that all patents, patent applications, patent publications and other publications (including web-based publications) mentioned and cited herein are hereby fully incorporated by reference herein as if set forth in their entirety herein.
Although illustrative embodiments of the present invention have been described herein with reference to the accompanying drawings, it is to be understood that the invention is not limited to those precise embodiments, and that various other changes and modifications may be affected therein by one skilled in the art without departing from the scope or spirit of the invention.
1. K. Beyer, R. Cochrane, L. Colby, F. Ozcan, H. Pirahesh, XQuery for Analytics: Challenges and Requirements, XIME-P:2004, pages 3-8, 2004.
2. S. Chaudhuri and U. Dayal, An Overview of Data Warehousing and OLAP Technology. Data Mining and Knowledge Discovery, 26(1):65-74, 1997.
3. World Wide Web Consortium. W3C Architecture Domain: XML. www.w3c.org/xml Online Documents.
4. J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh, Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab and Sub-Totals, Data Mining and Knowledge Discovery, 1(1):29-53, March 1997.
5. S. Paparizos, S. Al-Khalifa, H. V. Jagdish, L. V. S. Lakshmanan, A. Nierman, D. Srivastava, and Y. Wu, Grouping in XML, In EDBT Workshop 2002, pages 128-147, 2002.
6. P. Vassiliadis and T. Sellis, A Survey of Logical Models for OLAP Databases, ACM SIGMOD Record, 28(4):49-64, 1999.