The present invention relates to a method and apparatus for processing XML data and, more particularly, for developing security views of information contained within a larger assembly or organization of such information.
XML (Extensible Markup Language) is rapidly emerging as the new standard for data representation and exchange on the Internet. As corporations and organizations increasingly employ the Internet as a means of improving business-transaction efficiency and productivity, it is increasingly common to find operational data and other business information in XML format. In light of the sensitive nature of such business information, securing XML content and ensuring the selective exposure of information to different classes of users based on their access privileges is important. Specifically, for an XML document T there may be multiple user groups who want to query the same document. For these user groups, different access policies may be imposed, specifying what elements of T the users are granted access.
Access control models for XML data have been proposed; however, these models suffer from various limitations. For example, such models may reject proper queries and access, incur costly runtime security checks for queries, require expensive view materialization and maintenance, or complicate integrity maintenance by annotating the underlying data. More specifically, for a number of different users, having corresponding different access policies, each node in the XML document (i.e., the actual XML data) would have to be annotated to define such users' with the various levels of access allowed based on their individual user profiles. While such annotating may be easily performed if there are only a few user groups, annotating becomes increasingly complex as the number of user groups and corresponding access policies increases. There is also an undesirable possibility of generating errors in the XML document or in the XML data during the annotation process. Maintenance costs of the XML data also increases if it desired to modify a document at some point in the future. For example, adding a subtree of new elements in the XML data will require further annotating for each of the existing user groups again with the possibility of errors being generated in the data during this process.
Additionally, and with regard to user views, it is conceivable that many hundreds or possibly thousands of different views must be generated to satisfy all of the combinations of queries and users that the XML document serves. Such views are costly to prepare and maintain, as well as providing the specific XML data (which may be subject to tampering or error generation) as a result of view usage. Additionally, users are not provided with the exact structure of the data. As such, they do not know how to properly formulate a query which creates an overall inefficient system for storing, maintaining and subsequently accessing data. A more subtle problem is that none of these earlier models provides users with a Document Type Definition (DTD) characterizing the information that users are allowed to access. Some models expose the full document DTD to all users, and make it possible to employ (seemingly secure) queries to infer information that the access control policy was meant to protect. Accordingly, there is a need to provide access to XML data of an XML document without corrupting or otherwise changing the XML data and provide suitable query interaction with such data.
Various deficiencies of the prior art are addressed by the present invention of a method for providing controlled access to an XML document by defining at least one access control policy for a user of the XML document and deriving a security view of the XML document for the user based upon said access control policy and schema level processing of the XML document. The invention also includes a step of translating a user query based on the security view of the XML document to an equivalent query based on the XML document.
Deriving a security view includes invoking a first sub process that determines if a first accessible element type of an XML document DTD representing said XML document has been previously processed. If the first accessible element type has not been previously processed, then the first sub process performs the steps of computing a query annotation for each child element in a production rule of the first accessible element type computing a view production rule for first accessible element type in a view DTD representing an accessible portion of the XML document and computing a security view for each child element in the production rule of the first accessible element type. Computing a security view for each child element in the production rule of the first accessible element type includes invoking a second sub process if a child element in the production rule of the first accessible element type is inaccessible; otherwise, the first sub process is invoked for said child element. Translating the user query based on the security view of the XML document includes iteratively computing at least one local translation corresponding to at least one subquery of the first accessible element type that is part of the user query. The method can be practiced by a computer readable medium containing a program which, when executed, performs these operations.
Additionally, the invention includes an apparatus for performing an operation of securely providing access to XML data of an XML document that includes means for defining an access control policy for a user of the XML document and means for deriving a security view of the XML document for the user based on said access control policy and schema level processing of the XML document. The apparatus also includes means for translating a user query based on the security view of the XML document to an equivalent query based on the XML document.
The means for defining the access control policy includes an access specification that annotates a document DTD representing the XML document. Such an access specification can be derived by a database manager of the XML document. The means for deriving a security view of the XML document for the user includes a security view definition that defines query annotations in a document DTD representing the XML document. The means for translating a user query based on the security view of the XML document to an equivalent query based on the XML document includes a query evaluator that maps one or more nodes in the security view to corresponding one or more nodes in the document DTD representing the XML document. In this way, access of specific information in the XML document is provided only to those having the proper access specification and corresponding view without having to annotate or otherwise process the actual data in the XML document.
The teachings of the present invention can be readily understood by considering the following detailed description in conjunction with the accompanying drawings, in which:
To facilitate understanding, identical reference numerals have been used, where possible, to designate identical elements that are common to the figures.
This invention will be described within the context of Extensible Mark Up Language (XML). Consider an XML document T having any number of data elements arranged therein. A Document Type Definition (DTD) D is associated with T which governs the organization or exact structure of the data (also referred to as schema information). Multiple access control policies are declared over T at the same time, each specifying, for a class of users, what elements in T the users are granted, denied, or conditionally granted access to. A language is defined for specifying fine-grained access control policies. An access specification S expressed in the language is an extension of the document DTD D associating element types with security annotations (i.e., XPath qualifiers), which specify structure- and content-based accessibility of the corresponding elements of these types in T. Since the primary concern is with querying XML data, the specification language adopts a simple syntax instead of the conventional (subject, object, operation) syntax.
An access specification S is enforced through an automatically-derived security view V=(Dv,σ), where Dv is a view DTD and σ is a function defined via XPath queries. The view DTD Dv exposes only accessible data with respect to S, and is provided to users authorized by S so that they can formulate their queries over the view. The function ca is transparent to authorized users, and is used to extract accessible data from T. The only structural information about T that the users are aware of is Dv, and no information beyond the view can be inferred from user queries. Thus, the security views support both access/inference control and schema availability. An efficient algorithm is provided that, given an access specification S, derives a security view definition V, i.e., V characterizing all and only those accessible elements of T with respect to S based on schema level processing of the DTD D rather than merely annotating data within the document T.
Accordingly, an access control model 100 based on security views for an XML document 104 is presented and conceptually depicted in
The concepts of the subject invention are best realized when considering the following specification concurrently with the figures as follows. For example,
ann(A,B)::=Y|[q]|N,
where [q] is a qualifier in a fragment C of XPath. Intuitively, a value of Y, [q], or N for ann (A,B) indicates that the B children of A elements in an instantiation of D are accessible, conditionally accessible, and inaccessible, respectively. If ann (A,B) is not explicitly defined, then B inherits the accessibility of A. On the other hand, if ann (A,B) is explicitly defined it may override the accessibility of A. The root of D is annotated Y by default. This specification is depicted in
For an XML instance T of a DTD D, an access specification S=(D, ann) can be easily defined, e.g., using a simple GUI tool over D's DTD graph. Furthermore, S unambiguously defines the accessibility of document nodes in T. To see this, note that DTD D must be unambiguous by the XML standard. Since T is an instance of D, this implies that each B element υ of T has a unique parent A element and a unique production that “parses” the A subtree; thus, υ's accessibility ann (υ) can be defined to be exactly the ann (A,B) associated with the production for A. We say that υ is accessible with respect to S if and only if either
A security view 300 defines a mapping from instances of a document DTD D to instances of a view DTD Dυ that is automatically derived from a given access specification 400. Let S=(D,ann) be an access specification. A security view definition (or simply a security view) V from S to a view DTD Dυ, denoted by V:S→Dυ, is defined as a pair V=(Dυ,σ), where σ defines XPath query annotations used to extract accessible data from an instance T of D. Specifically, for each production A→α in Dυ and each element type B in α, σ(A,B) is an XPath query (in our class C) defined over document instances of D such that, given an A element, σ(A,B) generates its B sub elements in the view by extracting data from the document. A special case is the unary parameter usage with σ(rυ)=r, where rυ is the root type of Dυ and r is the root of D, i.e., σ maps the root of T to the root of its view.
The semantics of a security view definition V:S→Dυ are given by presenting a materialization strategy for V. Given an instance T of the document DTD, a view of T is built, (denoted by Tυ) that conforms to the view DTD Dυ and consists of all and only accessible nodes of T with respect to S. Then, a top-down computation is performed by first extracting the root of T and treating it as the root of Tυ, and then iteratively expanding the partial tree by generating the children of current leaf nodes. Specifically, in each iteration each leaf υ is inspected. Assume that the element type of υ is A and that the A production in Dυ is P(A)=A→α. The children of υ are generated by extracting nodes from T via the XPath annotation σ(A,B) for each child type B in α. The computation is based on the structure of production P(A) as follows:
(1) Nothing needs to be done when P(A) is A→ε
(2) P(A)=A→str. Then, the query p defined in (A,str) is evaluated at context node υ in T. If υ[[p]] returns a single text node in T that is accessible with respect to S, then the text node is treated as the only child of υ; otherwise, the computation aborts.
(3) P(A)=A→B1 . . . , Bn Then, for each i ∈ [1,n], the query pi=σ(A,Bi) is evaluated at context node υ in T. If for all i ∈ [1,n], υ[[pi]] returns a single node υi accessible with respect to S, then υi is treated as the Bi child of υ; otherwise, the computation aborts.
(4) P(A)=A→B1+ . . . +Bn. Then, for each i ∈ [1,n], the XPath query pi=σ(A, Bi) is evaluated at context node υ in T. If there exists one and only one i ∈ [1,n] such that υ[[pi]] returns a single node accessible with respect to S, then the node is treated as the single child of υ; otherwise, the computation aborts.
(5) P(A)=A→B*. Then, the query p=σ(A,B) is evaluated at context node υ in T. All the nodes in υ[[p]] accessible with respect to S are treated as the B children of υ, ordered by the document order of T. Note that, if υ[[p]] is empty, no children of υ are created.
A novel algorithm (termed “derive”) is presented that, given an access specification S=(D,ann), automatically computes a security view definition V=(Dυσ) with respect to S such that, for any instance T of the document DTD, if the computation of Tυ terminates (i.e., does not abort), it comprises all and only accessible elements of T with respect to S. One embodiment of algorithm “derive” is shown in
(a) if B is accessible, then pB is simply ‘B’ (steps 6,7);
(b) if B is conditionally accessible (i.e., ann(A,B)=[q]), then pB is ‘B’[q], i.e., qualifiers in S are preserved (steps 8, 9); and,
(c) if B is inaccessible, then the algorithm either prunes the entire inaccessible subgraph below B if B does not have any accessible descendants (step 11), or ‘shortcuts’ B by treating the accessible descendants of B as children of A if this does not violate the DTD-schema form of Section 2 (steps 12-15), or renames B to a “dummy” label to hide the label B while retaining the DTD structure and semantics (steps 16-20). Children of the B node are then processed in the same manner. In this way, the resultant view DTD Dυ, preserves the structure and semantics of the relevant and accessible parts of the original document DTD.
The procedure Proc InAcc(S,A) processes an inaccessible node A in a similar manner. One difference is that it computes (1) reg(A) instead of a in the A-production A→α in the view DTD Dυ, and (2) path [A,B] for each element type B in reg(A) rather than σ(A,B). Intuitively, reg(B) is a regular expression identifying all the closest accessible descendants of B in D, and path [A,B] stores the XPath query that captures the paths from A to B in the document DTD. Another difference concerns the treatment of recursive node. If an inaccessible A is encountered again in the computation of Proc_InAcc(S,A), then A is renamed to a dummy label and retained in the regular expression returned.
To efficiently compute V, Algorithm “Derive” associates two Boolean variables visited[A, acc] and visited[A, inacc] (initially false) with each element type A in the document DTD D. These variables indicate whether A has already been processed as an accessible or inaccessible node, respectively, to ensure that each element type of D is processed only once in each case. In light of this, the algorithm takes at most O(|D|2) time, where |D| is the size of the document DTD.
A more general depiction of the inventive concept is shown in
If the element type has not been previously processed, the method moves to step 1006 where a first computation is performed. Specifically, query annotation (for example denoted by the function σ) is computed for each child element Bi in the production rule for the element type A currently being processed. In one particular example, the query annotation is XPath query annotation. Once the query annotation is computed, the method proceeds to step 1008 to compute a view production rule Pv(A) for the element type A in the view DTD Dv. Once the computation of the view production rule is completed, the method moves to step 1010 where a security view for each child element Bi in the production rule for A is computed. In one embodiment of the invention, this computation is performed by invoking a process for inaccessible nodes if the child element Bi is inaccessible (with respect to A) otherwise the accessible element procedure for such Bi is called. After the security view is computed for each element Bi, the method ends at step 1012.
If the answer to the query is no, the method moves to step 1106 where a path for each child element Bi in the production of A is computed. Particularly and in one embodiment of the invention, the path is computed as Path [A, Bi] which is a value that stores the XPath query that captures the paths from A to B in the document DTD as discussed previously. Once the path has been computed, the method moves to step 1108 where a regular expression for A is computed. More specifically and as previously discussed, the value reg [A] is computed instead of α (reg[A] is defined as a regular expression identifying all the closest descendants of A in D). Once the regular expression for A has been computed, the method moves to step 1110 where the security view for each child element Bi in the production rule for A is computed. Specifically in one embodiment of the invention the security view is computed by calling Proc_InAcc if such child element Bi is inaccessible with respect to A, otherwise, Proc_Acc is called for Bi. Once the security view for each child element Bi is computed, the method ends at step 1112.
Once an access policy is determined, and a corresponding security view is derived for a particular user or user group, such user or user group can pose a query on the security view. The query allows the user to access information in the DTD according to such access policy without reviewing information that the user is not allowed to have access to. Further, in accordance with the subject invention, the actual data in the DTD or XML document is not accessed or made otherwise made available to the user for the possible situation of unauthorized tampering or otherwise error-creating accessing of the information. This is accomplished by the novel method of the query rewriting. That is, given an query p over the security view, p is automatically transformed to another XPath query pt over the document DTD D such that, for any instance T of D, p over Tυ and pt yield the same answer. In other words, p over the view is equivalent to pt over the original document (i.e., pt(T)=p(Tυ)). This eliminates the need for materializing views and its associated problems.
Specifically, given a query p over the view DTD Dυ, a rewriting algorithm “evaluates” p over the DTD graph Dυ. For each node A reached via p from the root r of Dυ, every label path leading to A from r is rewritten by incorporating the security-view annotations σ along the path. As a maps view nodes to document nodes, this yields a query pt over the document DTD D.
To implement this idea, the algorithm works over the hierarchical, parse-tree representation of the view query p and uses the following set of variables. For any sub-query p′ of p and each node A in Dυ, rw(p′,A) is used to denote the local translation of p′ at A, i.e., a query over D that is equivalent to p′ when p′ is evaluated at a context node A. Thus, rw(p,r)=pt is what the algorithm needs to compute. Reach (p′,A) is also used to denote the nodes in Dυ that are reachable from A via p′. Finally, N is used to denote the list of all the nodes in Dυ, and Q to denote the list of all sub-queries of p in “ascending” order, such that all sub-queries of p′ (i.e., its descendants in p's parse tree) precede p′ in Q.
Given the above, one embodiment of this Algorithm is identified as “Rewrite” and is presented in
In one embodiment of the method for query rewriting, the algorithm is generally shown as a series of method steps 1200 in
Once the initializations are performed, the method proceeds to step 1206 where a first sub process is called to compute a variable reach (//A) for each node A in the view DTD. Reach(//,A) is the set of descendant nodes of A in the view DTD Dv. The method then proceeds to step 1208 where the value of A is initialized to be the first node in N. The method then proceeds to step 1210 where computations of the values for rw (p′, a) and reach (p′a) are computed based on the type of sub-query p′.
Once those values are computed, the method moves to 1212 where an inquiry is made if a next node A from the sequence of nodes N is available. If the answer to the inquiry is yes, the method loops back to step 1210 where values for rw and reach are computed for the next node A value. If the answer to the query is no, the method moves to step 1214 where another query is posed. Specifically, if there is a next sub-query in the present node N in the sequence of sub-queries Q, then the method loops back to step 1208 to reinitialize A as the first node in N. If the answer to the query is no, the method proceeds to step 1216 where the equivalent query pt is assigned the value of rw(p,r) where r is the root of the view DTD Dv. The method ends at step 218.
Earlier per step 1206 a first subroutine was introduced that computes the value reach (//,A). This particular subroutine in one embodiment of the invention is identified as algorithm “recProc” and is shown as a series of method steps 1300 in
As discussed earlier with respect to step 1306 of algorithm recProc above, the second sub process to compute, reach and recrw in one embodiment of the invention is a series of method steps 1400 as shown in
If the answer to the inquiry at step 1408 is no, that is that node Y has not previously been processed, then the method proceeds to step 1410 where the parameter reach (//,X) is updated and then the subject Algorithm Traverse is called again with respect to child node Y of the presently processed node X. The parameter reach (//,X) represents all the descendant nodes in the view DTD that are reachable from X with an additional node Y.
Query rewriting becomes more intriguing when the view DTD is recursive. For example, consider the view DTD 704 shown in
A solution to this problem is by unfolding recursive nodes. Unfolding a recursive DTD node A is defined as creating distinct children for A following the A production. Referring to
As presented earlier, the rewriting algorithm transforms a query over a security view to an equivalent query over the original document. However, the rewritten query may not be efficient. Accordingly, query optimization in the presence of a DTD D is considered. In other words, given an XPath query p, find another query po such that over any instance T of D,
(1) p and po are equivalent, i.e., p(T)=po(T); and
(2) po is more efficient than p, i.e., po(T) takes less time/space to compute than p(T). This is not only important in our access control model where queries generated by Algorithm “Rewrite” are optimized using the document DTD, but is also useful for query evaluation beyond the security context.
Algorithm “Optimize”, is shown in one embodiment in
(1) For each sub-query p′ of p and each type A in the DTD D, opt (p′,A) denotes optimized p′ at A, i.e., a query equivalent to but more efficient than p′ when being evaluated at an A element. The variable is initially ‘⊥’ indicating that opt(p′,A) is not yet defined, which ensures that each sub-query is processed at each DTD node at most once.
(2) reach (p′,A) is the set of nodes in D reachable from A via p′, with an initial value φ.
(3) image (p′,A) is the image graph of p′ at A.
The algorithm also invokes the following procedures:
(1) recProc(A,B) is a mild variation of the version given in
(2) simulate(image (p1,A), image (p2,A)) checks whether image (p1,A) is simulated by image (p2,A), as described earlier.
(3) evaluate([q],A) evaluates a qualifier q at A by exploiting the DTD constraints, as given earlier.
A general description of Algorithm Optimize is seen as a series of method steps 1500 in
Experimental results clearly demonstrate both the efficiency of the subject query rewriting approach over a straightforward query rewriting approach (that is based on element-level security annotations) as well as the benefits of the subject optimization techniques, particularly for large documents. Specifically, the subject query rewriting approach can achieve an improvement by up to a factor of 40 over naive query rewriting, which can be further improved by up to factor of 2 using the subject optimization algorithm. Experimental data sets were generated with the real-life Adex DTD, which is a standard proposed by the Newspaper Association of America for electronic exchange of classified advertisements. XML documents were generated using IBM's XML Generator tool by varying the maximum branching factor parameter to obtain four documents: D1(3.2 MB), D2(16.7 MB), D3(51.55 MB), and D4(77.0 MB). For the Adex DTD, a security view for a user was created where he is permitted to access only data related to real estate advertisements and data related to buyers. This security view is created by simply annotating the children of the root element adex as “N” and both the real-estate and buyer-info descendants as “Y” in the Adex DTD. The following four XPath queries on the Adex security view were considered:
Three different approaches (naive, rewrite, optimize) were compared in these experiments, all of which are based on the use of security views for querying. The first (“naïve”) approach, which does not use DTD for query rewriting, requires the data documents to be annotated with additional element accessibility information and works as follows. A new attribute called “accessibility” is defined for each element in the XML document which is used to store the accessibility value of that element. The naive approach uses two simple rules to rewrite an input query to ensure that (a) it accesses only authorized elements and (b) it is converted to a query over the document. The first rule adds the qualifier [@accessibility=“1”] to the last step of the query to ensure (a). The second rule replaces each child axis in the query with the descendant axis to ensure (b). The second rule is necessary since an edge in a security view DTD can represent some path in the document DTD. Thus, the naive approach represents a simple rewriting approach that relies on element-level annotations instead of DTD for query rewriting. The second (“rewrite”) approach is the subject method of rewriting queries using DTD. The third (“optimize”) approach is an enhancement of the second approach that further optimizes the rewritten queries using the subject optimizations. To compare the performance of the three approaches, a state-of-the-art XPath evaluation implementation was used that has been shown to be more efficient and scalable than several existing XPath evaluators. The experiments were conducted on a 2.4 GHz Intel Pentium IV machine with 512 MB of main memory running Microsoft Windows XP.
The experimental results are shown in Table 1, where each row compares the query evaluation time (in seconds) of naive, rewrite, and optimize approaches for a given document and query. For queries that can not be further improved by the optimize approach, we indicate this with a “−” value under the optimize column.
The naive approach evaluates Q1 as //buyer-info//contactinfo[@ accessibility=“1”], while the rewrite approach utilizes the DTD to expand Q1 into a more precise query /adex/head/buyerinfo/contact-info.
The naive approach rewrites Q2 to //house//r-e.warranty [@accessibility=“1”]| //apartment//r-e.warranty [@accessibility=“1”] while the rewrite approach expands the query to /adex/body/adinstance/real-estate/house/r-e.warranty. Note that the rewrite approach has simplified the second sub-expression to empty since the r-e.warranty element is not a sub-element of apartment.
The naive approach evaluates Q3 as //buyerinfo[//company-id and //contact-info][@accessibility=“1”], while the rewrite approach expands the query to /adex/head/buyerinfo[company-id and contact-info]. The optimize approach further exploits the co-existence constraint that each buyer-info element has both company-id and contact-info sub-elements to simplify the rewritten query to /adex/head/buyer-info.
Query Q4 shows the benefit of exploiting the exclusive constraint. The rewrite approach expands the query to /adex/body/adinstance/real-estate [house/r-e.asking-price and apartment/r-e.unittype], which is further refined by the optimize approach to an empty query since the real-estate element can not have both house and apartment sub-elements; thus the evaluation of Q4 can be avoided.
Overall, the experimental results demonstrate the effectiveness of the proposed query rewriting technique for processing secured XML queries. The results also emphasize the importance of using DTD constraints to optimize the evaluation of XPath queries on large XML documents. Given these, Algorithm Optimize (D,A,p) rewrites query p at A elements based on the structures of p and A. It recursively prunes redundant sub-queries of p by exploiting the structural constraints of the DTD D.
Several embodiments of the present invention are specifically illustrated and/or described herein. However, it will be appreciated that modifications and variations of the present invention are covered by the above teachings and within the purview of the following claims without departing from the spirit and intended scope of the invention.
Number | Name | Date | Kind |
---|---|---|---|
6134549 | Regnier et al. | Oct 2000 | A |
20030101169 | Bhatt et al. | May 2003 | A1 |
20040015783 | Lennon et al. | Jan 2004 | A1 |
20040199905 | Fagin et al. | Oct 2004 | A1 |
20050203933 | Chaudhuri et al. | Sep 2005 | A1 |
Number | Date | Country | |
---|---|---|---|
20060143557 A1 | Jun 2006 | US |