XML (Extensible Markup Language) is a system for defining, validating, and sharing document formats. XML uses tags to distinguish document structures, and attributes to encode extra document information. The XML semi-structured data model has become the choice both in data and document management systems because of its capability of representing irregular data while keeping the data structure as much as it exists. Thus, XML has become the data model of many of the state-of-the-art technologies such as XML web services. Web service response times have large impacts on the response time of the front-end application since the front-end application may invoke multiple web service operations to serve an end-user request.
Caching data by maintaining materialized views (or query results) has many well-known benefits; one of the major benefits is improving query performance by answering queries from the cache instead of querying the source data. Caching data by maintaining materialized views typically requires updating the cache appropriately to reflect dynamic source updates. To be useful, a materialized view needs to be continuously maintained to reflect dynamic source updates. The problem of efficient incremental view maintenance has been addressed extensively in the context of relational data models but only few works have addressed it in the context of semi-structured data models.
Current web services caching approaches, e.g. the approach of Microsoft's .NET framework, follow a time-based invalidation scheme in which the cached results are invalidated after a pre-specified time period (life time). The drawbacks of such a scheme are: (1) the cached results are likely to be over-invalidated since the invalidation process does not take into account the relevance of the source updates to the cached results, (2) the invalidation operation implies recomputing the views whenever they are required again; this recomputation process is generally an expensive one, and (3) the “freshness” of the cached results is not guaranteed because source updates may take place just after a result has been cached, the effect of these updates will not be reflected in the cache before the lifetime of the cache expires. This might be inappropriate for critical applications which require a high level of consistency between the source and the cache.
The XML views maintained at the cache are assumed to be the results of certain queries (view specifications) issued against a source XML document. The W3C consortium is currently working towards standardizing XPath and XQuery as XML query and view specification languages. Path expressions form the core of the XPath and XQuery languages: they are the language constructs which are used to select and retrieve data from XML data sources. The retrieved data can be manipulated by other language constructs to form the final XML query result. Therefore, caching the results of path expressions could be potentially beneficial to answer general XML queries efficiently.
Generally, in order to maintain cached views, a maintenance algorithm needs to issue queries to the data source; querying the source is generally an expensive operation in terms of time and processing since the data source is usually huge in size. Conventional techniques for providing incremental view maintenance for structured data such as XML data is inapplicable to Web service caching and many other practical use cases due to the following limitations: (1) view specification models and source update models are very limited, (2) amount of additional data stored for maintenance (intermediate results) can be arbitrarily large regardless of the size of cached view results.
Systems and methods are disclosed for providing view maintenance by buffering one or more search results in a cache; and incrementally maintaining the search results by analyzing a source data update and updating the cache based on a relevance of the update to the search results.
Advantages of the system may include one or more of the following. The system provides incremental maintenance of views defined over XML documents using path expressions. The system minimizes the number and the size of the source queries which are used to maintain the cached results. The incremental view maintenance updates cached views to reflect source updates without a full recomputation of views. As a result, the system provides solutions for fast, scalable management of update management of distributed content with interdependency. The system also enables efficient Web service cache management that addresses performance issues of Web services. The solutions can be applied to other XML content dependency management applications such as: (1) XML content delivery including RSS dissemination (2) scalable configuration management of distributed systems (such as grid applications) through change dependency monitoring.
Other advantages can be as follows. The view specification language is powerful and standardized enough to be used in realistic applications. The size of the auxiliary data maintained with the views is upper bounded; it depends on the expression size and the answer size regardless of the source data size. The system does not require a source schema—the source data can be any general well-formed XML document. Moreover, the system off-loads processing from the back-end application to provide web services scalability. Thus, maintaining XML views is an integral problem that needs to be handled efficiently. Further, the view definitions are not restricted to monotonic. That is, the system handles cases where an addition in the source could result in addition or deletion in the view. Similarly, we handle cases where a deletion in the source could result in addition or deletion in the view.
The system also preserves the privacy of the data source; it is not required that the definitions of the expression predicates be disclosed for the maintenance algorithm to do its job. Only the expression axis and label tests are required. The predicate definitions might include any proprietary user defined functions. This privacy-preserving property is essential for web service caching projects where the web service provider might not be willing to disclose all the details of the view definitions (web service operations) to a third-party that is caching the web service responses.
The source data system 20 includes data 22, which is structured data such as XML data as well as an update engine 24 that updates the maintainer 16. A search query would access the cached views 14 if the cached data provides a current response. Alternatively, the query would access the source data 22 to formulate an answer to the query.
In one embodiment, the data 22 contains documents that conform to the Extensible Markup Language. The data uses tags (for example <em>emphasis</em> for emphasis), to distinguish document structures, and attributes (for example, in <A HREF=“http://www.xml.com/”>, HREF is the attribute name, and http://www.xml.com/ is the attribute value) to encode extra document information.
The pictorial illustration of
The label has the following properties:
Based on the definition of node labels, a selection condition in a query involving the node name, kind, or type is represented as a label test. For example, a condition that retrieves ‘book’ elements is a label test and a condition that retrieves nodes storing values greater than 5 is also a label test. A label test could also be the wildcard character “*” which matches all labels.
The XML tree of
Path expressions are the basic building blocks of XML queries. A path expression E of size N is a sequence of N steps: (s1, s2, . . . sN). A step si is a triple <si.axis, si.label, si.pred> where:
The first si processing starts at a pre-specified sequence of nodes in the source tree called the expression context C. Given an expression E, a document tree D, and a sequence of context nodes C (a sequence of some of the nodes of D), a query, Q, denoted as Q=q(E, C, D) returns a sequence of nodes R as a result. Conceptually, the execution of si (i>1) starts at the sequence outputted from executing si−1. The intermediate result of step si (1<i<N) as Ri=q(si, Ri−1, D), R0=C.
Every Ri, (1<i<N) is a sequence of nodes ordered by the document order. The final result R is defined as the result of the last operation; i.e. R=RN.
For example, consider the query Q=q(E, C, D) where: D is the document tree of
s1=/A
s2=//B [Count (//E)>1 OR Count(/D)>1]
s3=//C [Count (//E)=0]
s4=//D
In this query, the first step s1 starts at every node in C and selects all children with label A; this results in R1=(A1, A2, A3). Then s2 starts at every node in R1 and selects all the descendants with label B that have at least one descendant labeled E or at least one child labeled D; this results in R2=(B2, B3, B4, B5). Starting at R2, step s3 selects all the descendants labeled C that have no descendants labeled E; this results in R3=(C3, C4, C5, C5). Finally, s4 starts at R3 and selects all the descendants labeled D. Hence, the final result of Q is R=R4=(D3, D3, D4, D4).
A node can be duplicated in the answer of any step. This shows the possibilities of multi-derivations in path expression views. Multiple occurrences of the same node in a sequence are differentiated by using a numeric superscript. For example, the result R is denoted as R=(D31, D32, D41, D92).
The incremental maintenance process uses the following definitions regarding path expressions:
In one embodiment, certain simplification/restrictions are maintained to achieve an efficient view maintenance. First, only child and descendant axes are handled in the axis test as the child and descendant axes are the most commonly used axes in practice. The other axis types, such as parent and ancestor, are not handled. Second, a Predicate can examine only the subtree of the node being tested. In other words: Predi(n), for all i, is exclusively evaluated by examining the subtree rooted at n. This simplification is based on the fact that a node in an XML document is semantically described by its descendants, and thus selecting a node should depend on its label and its descendants. With this approach, predicate evaluation can only be done at the source XML data. The benefit is that the predicates can be arbitrarily complex and the predicates can preserve the privacy/security of the XML data source.
To illustrate an update, the result R of an example expression E is cached at the client site and subsequently the following update takes place at the source tree of
In this example, U changed Predi(n) for only one node (n=Bi) and one value of i (i=2). This change effectively added B1 to R2. Consequently, other nodes were added to other intermediate results but without U changing any more predicates; these are nodes C1, C2, D1, and D2 in the example. Thus, an update U causes a node n to be added to an intermediate result Ri under one of two possible scenarios:
1. U changes Predi(n) from false to true,
2. U does not affect Predi(n).
The first case is a direct addition and to the second case is an indirect addition because it is caused indirectly through a direct addition. Direct deletion can occur when U changes Predi(n) from true to false causing n to be deleted from Ri. Indirect deletion can occur when n is deleted from Ri without U affecting Predi(n). For example, if U=<Add, (R, X1, A1, B2, C3, E6)> then U directly deletes C3 from R3 because it changes Pred3(C3) from true to false. This direct deletion induces the indirect deletion of the first occurrence of D3 from R.
In the following discussion, δi+ denotes the sequence of all nodes that U directly adds to Ri; δi− denotes the sequence of all nodes that U directly deletes from Ri, and δi=δi+|_|δi−. Each of δi+ and δi− could have repetition due to multi-derivation possibilities and that δi+ and δi− are mutually disjoint because a node n can not be directly added to and deleted from Ri at the same time; that is because U can not change Predi(n) from false to true and from true to false at the same time.
Since any indirect addition or deletion is originated by a direct one, an embodiment of the maintenance process determines all direct additions and deletions at Ri and then determines the indirect effects that are induced by the direct effects. Ultimately the process determines indirect effects on the cached result R. The indirect effects on all the intermediate results Ri, i<N are not required per se, but they can be used to discover the final effects on R.
To discover indirect effects from the direct ones, the process handles two cases:
1. When a node n is directly added to Ri, then the maintenance algorithm has to issue a query to the source to determine the indirect additions that might happen due to this direct addition. For example, when B1 is added to R2, the indirectly added nodes C1, C2, D1, and D2 can not be retrieved without querying the source because they had no existence at the cache before U occurred. In general, when a node n is directly added to Ri then, in order to retrieve the indirect additions at all Rj, j>i, the maintenance process needs to issue a source query with context as the singleton sequence (n) and with the steps sequence (si+1, si+2, . . . sN). The query is denoted as: q((si+1, si+2, . . . sN), (n), D).
2. When a node n is directly deleted from Ri, then the nodes of R that came to R because n used to belong to Ri are deleted from Ri. In other words, all the nodes r of Ri that have ResultPathi(r)=n are deleted from R. In the example, the direct deletion of C3 from R3 results in deleting D31 from R because ResultPath3(D31)=C3.
Once result path of each node of R is known, the process discovers the necessary indirect deletions from R without issuing any source queries. The system thus keeps with every node nεR the result path ResultPath(n).
The collection of all the result paths is kept as auxiliary data which is not itself a target, but it is just used to achieve efficient incremental maintenance of the cached result R. In one embodiment, this is the only auxiliary data used. No two result paths are the same; even if a single node from the source tree occurs multiple times in R, each occurrence will be associated with a different result path.
The keeping of the result paths is not equivalent to keeping all the intermediate results Ris. In particular, if a node n in Ri does not lead to a node in R then the process does not keep n in the auxiliary data. For example, in the example
/A//B[Count(//E)≧1 OR Count(/D)≧1]//C[Count(//E)=]//D
The size of the auxiliary data is bounded regardless of the source tree. To compute this size, since each result path is of length N+1 and M is the size of the cached result R, then the size of the auxiliary data is O(M * N). The process stores only the node IDs in the result paths and the node labels are not needed. This limits the size of the auxiliary data because the node ids are machine generated as compact codes.
The determination of the direct effects is discussed next. This determination is done in two phases for every Ri: 1) the Axis&Label test and 2) the Predicates test.
(1) The Axis & Label Test. For every Ri, the sequence of direct effects δi is determined by querying the source because it might involve predicate evaluations to determine the nodes n for which Predi(n) has changed due to U. Since the amount of source queries is to be minimized, the Axis & Label phase identifies a sequence Δi such that, without any source queries, that δi⊂Δi. In the Predicates Test phase, Δi is further filtered by predicates evaluations to identify the exact sequence δi. In other words, the Axis & Label Test works as a first-level filter for identifying δi since every node n in δi also belongs to U.path. In other words, if, due to U, a node n belongs to δi for any i, then n must also belong to U.path. This limits the search space to the nodes in U.path.
Although U.path has all the information needed to conduct the axes and labels tests needed to identify δi, it does not have enough information to evaluate the predicates at any of its nodes n because a predicate can refer to any node in the subtree of n. The process applies the Axes and Label tests to U.path, ignoring the predicates tests. The result is the sequence Δi which is a super-sequence of δi.
Computing the different Δi's proceeds similar to computing the intermediate results Ri's of the original view specification query except that the latter selects from the source tree D while the former selects from the single branch U.path. Any node n in any δi must have a node of the expression context C as an ancestor. Thus, the process initializes Δ0 to be all the context nodes that exist in U.path, i.e. Δ0=C∩U.path. After this initialization, the process determines Δi (for i>1) as all the nodes in U.path that satisfy si.axis and si.label starting at nodes in Δi. This query is denoted as Δi=q(si.axis&label, Δi−1,U.path).
The following example shows the computation of the Δis. In an update U of adding a node D6 as a child of D4, U.path is the tree branch that starts with the root R and ends with D6. Computing the different Δi's as described above results in: Δ0=(X2, X3), Δ1=(A2, A3), Δ2=(B3, B4, B5), Δ3=(C5, C5), Δ4=(D4, D4, D6, D6).
Δi is a supersequence of δi: there are nodes in Δi that are not directly added to or deleted from Ri. For the example shown above, using the predicates as defined in the example path expression, the only nodes that will be directly added are the two occurrences of D6 that appear in Δ4. The other nodes n in all the computed Δi's will not be added or deleted because U did not affect Predi(n). Note that because D6 did not exist before U occurred, the value of Predi(D6), for all i is false before U occurred. The same holds with deletion updates: if an update U deletes a node n from the source tree, the value of Predi(n) is false after U occurred.
(2) The Predicate Test. The Predicate Test identifies the sequence δi from the sequence Δi. To accomplish this task, the process determines which nodes n in Δi had their Predi(n) changed due to U. To detect such changes, the process compares, for every node, the values of Predi(n) before and after U occurred. The value before U occurred is referred to as Predibefore(n) and to the value after U occurred as Prediafter(n). Nodes for which Prediafter(n) are excluded because they are not affected by U. Nodes with their Predi(n) changing due to U are directly added to or deleted from Ri.
The determination of the values of Prediafter(n) and Predibefore(n) for every node n in Δi is as follows. The value of Prediafter(n) is computed simply by querying the source. This query, in general, will be processed very quickly as it just evaluates the predicate si.pred at node n in the source tree D. the returned value is true or false. We denote this query as: predq(si.pred, (n), D).
The query is performed by a source query processor with the following benefits:
The value of Predibefore(n) cannot be computed by a source query because the update U has already been incorporated at the source. Instead, the value of Predibefore(n) is deduced as follows: if node n appears as the i-th element in the result path of any node in R then this implies that n was qualified for Ri before U occurred; hence, Predibefore(n)=true. Let RPi(n) be true if and only if n is the i-th element of the result path of any node in R, then RPi(n)=>Predibefore(n). This shows how the auxiliary data—which was originally intended to be used for discovering indirect deletions—could help in the predicate test as well. However, if RPi(n) is false then the value of Predibefore(n) cannot be determined because it may be false or true. Thus, if RPi(n) is false, there is an ambiguity about the value of Predibefore(n).
One implementation to resolve this situation includes in the auxiliary data all the nodes that qualify to be in any intermediate result Ri instead of only including those nodes that actually lead to nodes in the final result R. However, the size of the auxiliary data can become unbounded. In another implementation, the ambiguity is resolved by simply assuming that Predibefore(n) is false. This assumption does not affect the result of discovering the indirect effects in R.
In the first step of the loop, every Δi is computed from Δi−1. One implementation improves performance by excluding some nodes from Δi−1 before moving on to the computation of Δi in the next loop iteration. This will result in a smaller Δi and hence in improved performance. The sequence achieved by reducing Δi is referred to as Λi. Hence, in order to discover all the ultimate effects on R, the process only needs to start each iteration i only at the nodes n of the previous iteration for which the value of Predi−1(n) is true before and after U occurred. In other words, the process takes only the nodes n that have RPi−1(n)=Prediafter(n)=true.
Step 2-2 issues small source queries to evaluate Prediafter(n) for every node n in Λi. According to the results of these queries, Λi is partitioned into the two disjoint sequences T and F. Then, step 2-3 identifies the nodes of T that will be considered as direct additions at Ri.
The sequences of nodes to be added to/deleted from R due to the direct effects at every iteration as R+/R−,respectively. These sequences are computed by steps 2-4 and 2-5 respectively. Conforming to the process of discovering indirect effects, step 2-4 issues a source query while step 2-5 only uses the auxiliary data. Instead of issuing a separate source query for every direct addition, step 2-4 uses a single query with a combined context sequence which incorporates all the direct additions at one shot, this should perform better than issuing many queries.
Finally, step 2-6 updates R by incorporating the nodes of R+ and R−. The maintenance process needs to maintain the auxiliary data as well as the cached result R. For every node n removed from R, ResultPath(n) is removed from the auxiliary data; and for every node n added to R, ResultPath(n) is added to the auxiliary data. Computing the result paths requires some cooperation from the source query processor: the query processor should return with every node n in the answer of the query in step 2-4 its result path ResultPath′(n). This result path is a partial path of length N−i<N because the query in step 2-4 uses only steps si+1, si+2, . . . , sN of the original expression. Thus, to get the full result path ResultPath(n), the process concatenates ResultPath′(n) to the right end of a second result path of length i. This second path is the one which led from a node in the original expression context C to the first node in ResultPath′(n); it can be found by tracing the sequences Λ0, Λ1, . . . Λi through the iterations 1, 2, . . . , i. For clarity of the presentation, this secondary process of maintaining the auxiliary data is not shown in the process of
The process of
In the tests, the system maintains one cached object (such as an XPath query result) and processes node updates one by one. For each update, the time required for incremental maintenance is compared with the time required for the full view recomputation.
The XMARK benchmark was used to generate source documents with two data sets of different sizes: Data set 1 (325236 nodes), and Data set 2 (1281843 nodes).
The XML data source was implemented using a relational database. The node ids were generated based on the OrdPATH scheme. Each node was represented as a row of a table with the following columns {id, type, label, value, parent_id} where id is a node identifier and type is a node type (element, attribute, or value). When type is “element”, label represents the element name. When type is “attribute”, label represents the attribute name, and value represents the attribute value. When type is “value”, value represents the data value. Although an OrdPATH node id contains information about the id of the parent node, a column parent-id is used to represent the ID of the parent for performance optimization. The tests were done using an Oracle 9i database on a PC with Linux 8.0, Pentium 4 1800 MHz CPU, and 1 GB memory.
The following two XPath queries were used:
where “like” is a boolean predicate that corresponds to SQL's “like” operator.
The XPath Query 1 is implemented as the following SQL join query:
where “x” is the name of the table that contains the source nodes. Similarly, the XPath Query 2 is also implemented as a join query. The Predicate test query for the XPath query 1 is implemented as the following SQL query:
where ‘?’ represents a context node.
For each data set and query pair, 100 source updates were randomly generated. An average of results for full query verses incremental maintenance is as follows:
The results of the time comparison for all the updates are shown in
The supported view specification language of path expressions is powerful for many applications. The size of the auxiliary data used in bounded as O(M * N) where M is the size of the cached result and N is the size of the view specification expression. The size of the auxiliary data is compact and does not exceed this bound regardless of the complexity of the source XML tree and regardless of the complexity of the predicates used in the view specification path expression. The process delegates any predicate evaluation to the source query processor; the benefits of this delegation are two-fold (1) No auxiliary data is kept for the evaluation of predicates; without this delegation, the size of the auxiliary data can not be bounded. (2) The privacy of the predicate definitions is preserved since the cache manager need not know such definitions in order to maintain the views. This property is useful when the predicate definitions include proprietary functions that the data provider is not willing to reveal, for example, an XML web service provider would be able to use the XML caching system without disclosing its complex predicate definitions. The process does not depend on any schemas for the source XML document, it can handle any general XML document. Regarding the efficiency of the maintenance process, the experimental results show that incrementally maintaining path expression views using the approach presented here is much faster than maintaining the views by recomputing the view specification query.
One embodiment of the view maintenance process is written as the following code:
The following example, together with the nodes of
In example 1-1, an update changes the price for node 04812 from $10 to $12 and result set does not change as follows:
In example 1-2, another update changes the price from $5.99 to $6.99 and the result set becomes “The Catcher in the Rye”01111, “Franny and Zooey”01131
In Example 1-3, another update changes the price from $6.99 to $5.99 and the result set in this case does not change.
Similarly, Examples 2, 2-1 and 2—are as follows:
Although the foregoing has focused on processing the two primitive update operations of adding and deleting leaf nodes, it can be more efficient to handle a complex update, such as adding or deleting subtrees, holistically rather than by decomposing it into the primitive operations. The process for the primitive updates can be extended to handle the complex updates of adding or deleting subtrees. In this case, the U.path becomes a branch that ends with a subtree from the last node, this is the added or deleted subtree. The direct effects can be determined by applying the Axis&Label test and the Predicates test on this branch. Once the direct effects are discovered, the indirect ones can be discovered in the same way as described above.
Generally, source updates may occur simultaneously with the view maintenance process. Consider this scenario, an update U1 occurs and is reported to the cache manager, thus, the cache manager initiates a view maintenance process to update the cached views according to U1. At this time a new update U2 occurs at the source before the source query processor processes the queries which the maintenance process of U1 is using to maintain the views. In this case, processing these queries at the source will include the effects of U2 as well as those of U1. Then when U2 is reported to the cache manager, a new maintenance process will be initiated to maintain the views according to U2. This second maintenance process will typically need to issue queries to the source to maintain the views. However, this second maintenance process could take advantage of the fact that the effect of U2 has already been incorporated in the answers of the queries that were issued in response to U1. If such cases are detected, the view maintenance process could be made more efficient by reducing the number of source queries used to maintain the views. One embodiment to detect such cases is to use time-stamps for all the updates and the query answers received from the source; with that, the cache manager can determine which update effects have been incorporated in which answers. Caching systems normally cache the results of multiple expressions. Upon receiving an update U the presented maintenance algorithm can be run to maintain every expression separately. However, if many of these expressions have significant overlap in their structure, the process can maintain such collections collectively to improve efficiency. For example, efficiency can be gained by evaluating the predicates without source queries.
The invention has been described in terms of specific examples which are illustrative only and are not to be construed as limiting. The invention may be implemented in digital electronic circuitry or in computer hardware, firmware, software, or in combinations of them. Apparatus of the invention may be implemented in a computer program product tangibly embodied in a machine-readable storage device for execution by a computer processor; and method steps of the invention may be performed by a computer processor executing a program to perform functions of the invention by operating on input data and generating output. Suitable processors include, by way of example, both general and special purpose microprocessors. Storage devices suitable for tangibly embodying computer program instructions include all forms of non-volatile memory including, but not limited to: semiconductor memory devices such as EPROM, EEPROM, and flash devices; magnetic disks (fixed, floppy, and removable); other magnetic media such as tape; optical media such as CD-ROM disks; and magneto-optic devices. Any of the foregoing may be supplemented by, or incorporated in, specially-designed application-specific integrated circuits (ASICs) or suitably programmed field programmable gate arrays (FPGAs).
From the foregoing disclosure and certain variations and modifications already disclosed therein for purposes of illustration, it will be evident to one skilled in the relevant art that the present inventive concept can be embodied in forms different from those described and it will be understood that the invention is intended to extend to such further variations. While the preferred forms of the invention have been shown in the drawings and described herein, the invention should not be construed as limited to the specific forms shown and described since variations of the preferred forms will be apparent to those skilled in the art. Thus the scope of the invention is defined by the following claims and their equivalents.