1. Field of the Invention
The present invention is directed generally to a method of structuring and compressing labeled trees of arbitrary degree and shape for optimal succinctness, the method includes a transform for compressing and indexing tree shaped data. More specifically, the present invention includes a transform that uses path sorting and grouping to linearize labeled tree shaped data into two coordinated arrays, one capturing the structure of the tree and the other capturing the labels of the tree. The present invention also may include performing navigational operations on the labeled tree shaped data after the data has been transformed and possibly compressed.
2. Background of the Technology
Labeled trees are used for representing data or computation in computer applications including applications of tries, dictionaries, parse trees, suffix trees, and pixel trees. Trees are also used in compiler intermediate representations, execution traces, and mathematical proofs. XML also uses a tree representation of data where each node has string labels.
In a rooted, ordered, static tree data structure Ton t nodes where each node u has a label in the alphabet Σ. The children of node u are ranked, that is, have a left-to-right order. Tree T may be of arbitrary degree and of arbitrary shape. Basic navigational operations, such as finding the parent of u (denoted parent(u)), the ith child of u (denoted child(u, i)) and any child of u with label a denoted child(u, α)) are important on tree data structures.
Initially, a solution for navigational operations was to represent the tree using a mixture of pointers and arrays using a total of O(t) RAM words each of size O(log t), which trivially supports such navigational operations in O(1) time taking a total of O(t log t) bits. However, these pointer based tree representations are wasteful in space.
Jacobson introduced the notion of succinct data structures, that is data structures that use space close to their information-theoretic lower bound and yet support various operations efficiently. Succinct data structures are distinct from simply compressing the input to be uncompressed later. See G. Jacobson, Space-efficient Static Trees and Graphs, FOCS 1989, 549-554, the contents of which are incorporated herein by reference.
Jacobson initiated this area of research with the special case of unlabeled trees, considering the structure of the trees but not the labels. The number of binary (unlabeled) trees on t nodes is Ct=((2t+1)/t)/(2t+1). Therefore, log Ct=2t−Θ(log t) is a lower bound to the storage complexity of binary trees. Jacobson presented a storage scheme in 2t+o(t) bits while supporting the navigation operations in O(1) time. This method is also asymptotically optimal (up to lower order terms) in storage space.
Munro and Raman extended the method of Jacobson with more efficient as well as a richer set of operations, including sub-tree size queries. See I. Munro and V. Raman, Succinct Representation of Balanced Parentheses, Static Trees and Planar Graphs, IEEE FOCS 1997, 118-126, herein incorporated by reference.
Other known practices have further generalized these teachings to trees with higher degrees and ever richer sets of operations, such as level-ancestor queries. Succinct representations have been invented for other data structures including arrays, dictionaries, strings, graphs, and multisets.
However, each of these practices deals with unlabeled trees. The fundamental problem of structuring labeled trees succinctly has remained un-solved, even though labeled trees arise frequently in practice. Classical applications of trees in Computer Science, whether for representing data or computation, typically generate navigational problems on labeled trees.
The information-theoretic lower bound for storing labeled trees is 2t+t log |Σ|, where the first term follows from the structure of the tree and the second from the labels. One method for storing labeled trees may include replicating the known structures for the unlabeled case |Σ| times. This method is somewhat improved by deriving a succinct representation for labeled trees that uses
2t+t log |Σ|+O(t|Σ|(log log log t)/(log log t))
bits of storage and supports navigational operations in O(1) time. However, this is far from optimal for even moderately large Σ since the O( ) term dominates the others for Σ=Ω (log log t), and many common applications routinely generate labeled trees over large alphabets. For example, XML processing and execution traces often generate labeled trees over large alphabets. The known techniques that work on the unlabeled tree structure cannot embed the label information in a way that is suitable for efficient navigation or compression for labeled trees. For example, see R. F. Geary, R. Raman, and V. Raman, Succinct Ordinal Trees with Level-Ancestor Queries, in Procl. 15th ACM-SIAM symposium on Discrete Algorithms (SODA), pages 1-10, 2004.
Therefore, as labeled trees arise frequently in practice, there is a need in the art for a method allowing the succinct representation and efficient navigation of labeled trees.
The present invention solves the above-identified needs, as well as others, by providing a new approach to structuring labeled trees. The present invention provides a method of compressing and indexing a labeled tree of data into two coordinated arrays, one capturing the structure of the tree and the other capturing the labels of the tree. The present invention may also include performing navigational operations on the labeled tree of data after the data has been transformed. The present invention provides the first known (near) optimal results for succinct representation of labeled trees with O(1) time for navigation operations, independent of the size of the alphabet Σ and the structure of the tree T. The tree T may be of arbitrary degree and arbitrary shape. (See Ferragina et al., Indexing Compressed Text, Journal of the ACM Vol. 52, No. 4, July 2005, the contents of which are incorporated herein by reference.)
The present invention provides a succinct data structure for labeled trees based on the xbw transform using optimal (up to lower order terms) tH0(Sα)+2t+o(t) bits and supporting subpath queries in time O(|p| log |Σ|) where |p| is the length of the path. This data structure also supports navigational queries in O(log |Σ|) time. If |Σ|=O(polylog(t)), the subpath query takes optimal O(|p|) time, and navigational queries take optimal O(1) time.
Algorithmic results based on Jacobson's method of succinctly structuring unlabeled trees to that of arrays and parentheses have become very technical, involving stratifications, tree partitioning, etc. Extending these techniques to labeled trees and the powerful subpath query would have entailed making the algorithms even more complicated.
In contrast, the present invention is simple to implement and leads beyond mere succinctness to provide both entropy-bounded as well as succinctness results in a unified framework for structuring labeled trees. Other objects, features, and advantages will be apparent to persons of ordinary skill in the art from the following detailed description of the invention and accompanying drawings.
One embodiment includes a method of compressing and indexing data within a labeled tree structure, the method comprising inputting data in a labeled tree structure; transforming the data into two coordinated arrays; and outputting the transformed data for one of compressed storage, compressed display, compressed transmission, and compressed indexing.
In another embodiment the method is capable of being applied to the transmission, of compressed labeled trees of arbitrary degree and of arbitrary shape.
In another embodiment, the data is a data format capable of being modeled as a labeled tree. The data may be XML data.
In another embodiment, the method is capable of being applied to the transmission of compressed XML data.
In another embodiment, transforming the data includes using path sorting and grouping the data to linearize the data.
In another embodiment, one array corresponds to information stored in the structure of the labeled tree, and the second array corresponds to information stored in labels of the labeled tree.
In another embodiment the method is capable of being applied to labeled trees of arbitrary degree and arbitrary shape.
In another embodiment the data is XML data. This embodiment may further include computing XBW(d)=<Ŝlast, Ŝα, Ŝpcdata> for the XML data; merging Ŝα and Ŝlast into Ŝ′a; and compressing separately Ŝ′α and Ŝpcdata. Alternatively, this embodiment may further include computing XBW(d)=<Ŝlast, Ŝα, Ŝpcdata> for the XML data; storing Ŝlast using a compressed representation supporting rank/select queries; storing Ŝα using a compressed representation supporting rank/select queries; splitting Ŝpcdata into buckets such that if two elements have the same upward path, the two elements will be in the same bucket; and compressing each bucket using any compressed full-text index. For example, FM-index may be used.
Another embodiment may include computing XBW(d)=<Ŝlast, Ŝα, Ŝpcdata> for the XML data; storing Blast using a compressed representation supporting rank/select queries; storing Ŝα using a compressed representation supporting rank/select queries; and storing Ŝpcdata using a compressed data structure supporting substring searches in a subarray of Ŝpcdata specified at a query time.
In one embodiment, the transformed data can be reconstructed from the two arrays.
Another embodiment may further include creating an array [1,n]; visiting the internal nodes of the labeled tree structure in preorder; writing in the array the level of each visited node and the position in the tree structure of a parent of each visited node; recursively sorting each upwards path of the labeled tree starting at nodes at levels≠j (mod 3) wherein j is an element of {0,1,2,}; sorting each upwards path starting at nodes at levels≡j mod 3 using the result from the recursive sorting; and merging the two sets of sorted paths.
This embodiment may further include reconstructing the compressed and indexed data, wherein reconstructing includes building an array F according to BuildF(xbw(T)). This embodiment may further include building an array J according to BuildJ(xbw(T)) and recovering the transformed data using the array J.
Another embodiment may further include performing at least one navigational query on the compressed indexed data. In this embodiment, the at least one navigational query includes at least one from the group consisting of querying the parent of u, querying the ith child of u, and querying the ith child of u with label c, wherein u is a node of the labeled tree, i is an integer, and c is a character. Performing the at least one navigational query may include performing GetChildren or GetParent.
Another embodiment may include counting queries over a plurality of children of a node of the labeled tree, the node having a label c. In this embodiment, counting queries may further include a sort count of (u,c) where u is a node and c is a character, and further comprise returning a result of the number of children of u being labeled c. This method may include using a variation of GetChildren.
Another embodiment includes a method of searching compressed and indexed data from a labeled tree structure, wherein the method of searching includes inputting data in a labeled tree structure; transforming the data to a compressed and indexed form including two arrays, wherein one array corresponds to the structure of the tree and the other array corresponds to the labels of the tree; receiving terms for a search of the compressed indexed data; searching the compressed indexed data, wherein searching the compressed index data includes decompressing only a small fraction of the data; and at least one from the group consisting of outputting the results of the search and counting the number of search results.
This embodiment may further include performing a SubPathSearch.
For a more complete understanding of the present invention, the needs satisfied thereby, and the objects, features, and advantages thereof, reference is now made to the following description taken in connection with the accompanying drawings.
For a labeled tree T of arbitrary fan-out, depth, and shape, having n internal nodes labeled with symbols drawn from a set ΣN={A, B, . . . }, and l leaves labeled with symbols drawn from ΣL={a, b, c, . . . }, the present invention assumes that ΣN∩ΣL≠0, so that internal nodes can be distinguished from leaves directly from their labels. Σ, the set of labels effectively used in T's nodes, is set as Σ =ΣN∪ΣL. The symbols in Σ are encoded with integers in the rage [1, |Σ|] (ΣN'S first symbol). A preliminary bucket sorting step may be needed in order to compute this labeling. The overall size of T is t=n+l nodes. For each node u, α[u], which is an element of Σ denotes the label of u, and π[u] denotes the string obtained by concatenating the symbols on the upward path from u's parent to the root of T.
π[u] is formed by labels of internal nodes only. As sibling nodes may be labeled with the same symbol, many nodes in T may have the same π-string. For example, in
The present invention comprises a method including the step of creating a sorted multiset S consisting of t triplets for each node. S is created by first visiting T in pre-order and, for each visited node u, inserting the triplet s[u]=<last[u], α[u], π[u]> in S, where last[u] is a binary flag set to 1 if u is the last child of its parent in T. Then, S is stably sorted lexicographically according to the π-component of its triplets. As noted above, the same triplet may appear more than once because siblings nodes may have the same label. Thus, the stability of the sorting procedure assists in preserving the identity of triplets after the sorting step.
Hereinafter Slast[i] (resp. Sα[i], Sπ[i]) to refer to the last (resp. α, π) component of the i-th triplet of S. An example for a tree structure is shown in
The sorted set S[1, t] includes the following properties: Slast has n bits set to 1 (one for each internal node); the other t−n bits are set to 0; Sα contains all the labeled nodes of T; and Sπ contains all the upward labeled paths of T. Each path is repeated a number of times equal to the number of its offspring. Thus, Sα is a lossless serialization of the labels of T whereas Slast provides information on the groupings of the children of T's nodes.
The following structural properties of T can be inferred from S's sorting:
1. The first triplet of S refers to the root of T.
2. If u′ and u″ are two nodes of T such that π[u′]=π[u″], nodes u′ and u″ have the same depth in T, and u′ is to the left of u″ if the triplet s[u] precedes the triplet s[u″] in S.
3. If u1, . . . , uc are the children of a node u in T, the triplets s[u1], . . . , s[uc] lie contiguously in S following this order. Moreover, the last triplet s[uc] has its last-component set to 1, whereas all the other triplets have their last-component set to 0.
4. If v1, v2 denote two nodes of T such that α[v1]=α[v2], and s[v1] precedes s[v2] in S, then the children of v1 precede the children of v2 in S.
This transform of the labeled tree Twill be hereinafter denoted by xbw(T). Xbw(T) denotes the pair <Slast, Sα>. Slast is a binary string of length t and Sα is a permutation of the t labels associated to the nodes of T. T can be recovered from Slast and Sα, therefore, xbw is an invertible transform.
The xbw transform includes taking t log |Σ| bits for Sα, plus t bits for Slast.
In one embodiment, the internal nodes and leaves may be labeled using one unique alphabet (ie. ΣN=ΣL). Then, an additional bit array of length t is needed to distinguish between leaves and internal nodes in Sα. The overall space is then 2t+t log |Σ|. This is optimal up to lower order terms because the information theoretic lower bound on the space for representing a t-node ordinal tree is 2t−O(log t) bits, and the term t log |Σ| is the optimal cost of representing the labels in the worst case.
However, labeling leaves and internal nodes with two distinct alphabets simplifies the algorithms applied in the transform. The space cost of the additional bit array does not asymptotically influence the final bounds.
Converting T to xbw(T)
For the computation of xbw (T), explicitly building S would require a large amount of space and time. For example, in a degenerate tree with a single path of t nodes, the overall size of Sπ would be Θ(t2). The construction of S in a preferred embodiment of the present invention is therefore not explicit. In this preferred embodiment, the method includes sorting the π-components using an algorithm somewhat similar to the skew algorithm for suffix array construction. See J. Kärkkäïnen and P. Sanders, Simple Linear Work Suffix Array Construction, Procl. ICALP, 943-955, 2003, the contents of which are incorporated herein by reference. However, in the original skew algorithm, the recursion consists of sorting all suffixes starting at positions≠1 (mod 3), whereas the present invention works equally well if instead of 1, 0 or 2 are used.
Referring to
In step S3, j is defined as an element of {0, 1, 2} such that the number of nodes in IntNodes whose level is=j(mod 3) is at least n/3. In step S4, the upwards paths are sorted recursively starting at nodes at levels≠j(mod 3).
In step S5, the upwards paths are sorted starting at levels=j (mod 3). using the result of step S4. Finally, in step S6, the two sets of sorted paths are merged. In steps S3 and S4, the parameter j is chosen in such a way that the number of nodes being at level=j (mod 3) is at least t/3.
A lexicographic name is assigned to each path according to its first three symbols, wherein the symbols are obtained using the parent pointers. For example, radix sort may be used to assign the lexicographic names.
Then, a new “contracted” tree is built, whose labels are the names previously assigned. Based on the choice of j, the new tree will have at most 2t/3 nodes. The new tree will have a fan-out larger tan the original one. However, this does not affect the algorithm that only uses parent pointers.
This method runs in optimal O(t) time and uses O(t log t) bits of space. The running time satisfies the recurrence T(t)=T(2t/3)+Θ(t) and is therefore O(t). Once the π-components have been sorted, Slast and Sα may be constructed simply.
Reconstructing T from xbw(T)
As noted above, xbw is an invertible transform and T can be recovered. The method of reconstruction of T given xbw(T)=<Slast, Sα> in O(t) time involves three phases. In the first phase, an array F[1, |ΣN|] is built which approximates Sπ at its first symbol. For every internal-node label x which is an element of ΣN, F[x] stores the position in S of the first triplet whose π-component is prefixed by x. For example, in
In the second phase, an array J[1, t] is built which allows a “jump” in s from any node to its first child. J[i] is set as J[i]=j if S[i] is an internal node and S[j] is the first child of s[i]. If S[i] is a leaf, J[i] is set as J[i]=−1. For example, in
In the third phase, the original tree T is recovered using array J.
A. First Phase
The array F[1, |ΣN|] is computed in O(t) time using the following method including BuildF(xbw(T)), as shown in
As shown in step 2 of BuildF(xbw(T)), F[1] is set as F[1]=2 because Sπ[1] is the empty string, which occurs only once in Sπ, and Sπ[2] is therefore prefixed by symbol 1. In connection with step 4, it is assumed that F[i] is the position of the first entry in Sπ prefixed by symbol i. There are C[i] internal nodes labeled by i and their children occur contiguously in S starting at position F[i]. Thus, the last entry in Sπ prefixed by i is the one corresponding to the C[i]th 1 in Slast counting from position F[i]. The loop in steps 6-8 serves the purpose of counting C[i] is starting from Slast [F[i]]. Therefore, the value F[i+1] is correctly set at step 9.
B. Second Phase
In the second phase of the inversion method, the array J[1, t] is computed in O(t) time. This phase makes use of the properties of the xbw transform that the k-th occurrence of the symbol j, which is an element of ΣN, in Sα corresponds to the k-th group of siblings counting from position F[j]. This allows a computation of J with a simple scan of the array Sα, as shown below in BuildJ(xbw(T)), as illustrated in
C. Third Phase
In the third phase, T is recovered given Sα, Slast and the array J. T may be represented in several ways. For example, the nodes of T may be listed in depth-first order, using a stack Q in which pop/push pairs <i, u> where I denotes the position of the node in S and u denotes the node label. For each popped node <i, u>, arrays J and Sα can be used to locate its children (if any) and insert them in the stack in reverse order.
Navigational Operations on Transformed Data
The present invention provides a method of producing a succinct data structure for data included in labeled trees that uses at most 2t log |Σ|+O(t) bits and that supports all navigational queries such as—parent(u), child(u, i), and child(u, α)—in optimal O(1) time. The space used is nearly optimal for any Σ, being at most twice the information theoretic lower bound of t log |Σ|+2t.
Using a succinct representation of Sα and Slast, in connection with an additional array, allows navigation over T following node labels or the ranking of nodes among their siblings. This involves the use of rank and select operations over arbitrary sequences. Given a sequence S[1, t] over an alphabet Σ, rankc(S, q) is the number of times the symbol c which is one of the set Σ appears in S[1, q]=s1s2 . . . sq, and selectc(S, q) is the position of the q-th occurrence of the symbol c which is one of the set Σ in S.
If Σ={0, 1}, the data structure in supports rank1(s, q) (when s[q]=1) and select1 queries in O(1) time using log(|s|m)+o(m)+O(log log |s|) bits, where s is a sequence over alphabet Σ and where m is the number of 1's in s. (See R. Raman, V. Raman, and S. Srinivasa Rao. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In Proc. 13th ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 233-242, 2002, the entire contents of which are incorporated herein by reference)
For |Σ|=O(polylog(t)), the generalized wavelet tree in supports rankc and selectc queries in O(1) time using |s|H0(s)+o(|s|) bits of space, where H0(s) denotes the 0th order empirical entropy of the sequence s. (See P. Ferragina, G. Manzini, V. Makinen, and g. Navarro. An alphabet-friendly FM-index. In Proc. 11th Symposium on String Processing and Information Retrieval (SPIRE '04), pages 150-160. Springer-Verlag LNCS n. 3246, 2004. To appear in “ACM Transactions on Algorithms”, see also Tech. Report of univ. Chile #TR/DCC-2004-5, 2004, the entire contents of which are incorporated herein by reference.)
For general Σ, the wavelet tree in supports rankc and selectc queries in O(log |Σ|) time using |s|H0(s)+o(|s|) bits of space, where H0(s) denotes the 0th order empirical entropy of the sequence s. (See R. Grossi, A. Gupta, J. Vitter, High-Order Entropy-Compressed Text Indexes. Proc. 14th ACM-SIAM Symposium on Discrete Algorithms (SODA), 841-850, 2003, the entire contents of which are incorporated herein by reference.)
These data structures support the retrieval of si for any i in the same time as the rank and select queries.
Based on the array F, we define the binary array A[1, f] such that A[j]=1 iff j=F[x] for some x which is an element of ΣN. In other words, the ones in A mark the positions j such that the first symbol of Sπ[j] differs from the first symbol of Sπ[j−1].
Given an index i, 1≦i≦t, the algorithm GetChildren in
For example, the method in
Given an index i, 1≦i≦t, the method GetParent in
The method in
There is a technical detail to achieve constant time rank and select queries over Sα. Recall that ΣN is mapped onto the range [1, |ΣN|]. If B is a binary sequence of length |ΣN|t such that, for c is an element of ΣN, then B[t(c−1)+i]=1 iff Sα[i]=c. In other words, B consists of |ΣN| segments of length t such that the x-th segment is a bitmap for the occurrences of the symbol x in Sα. If occ is an array of size n storing in occ[c] the number of occurrences in Sα of symbols smaller than c, which is an element of ΣN. It is easy [12] to verify that for c, which is an element of ΣN
rankc(Sα,i)=rank1(B,c|ΣN|+i)−occ[c];
selectc(Sα,i)=select1(B,occ[c]+i)−c|ΣN|.
Using rank1 and select1 operations over B one can answer in O(1) time rankc and selectc queries over Sα for any symbol c, which is an element of ΣN (note that algorithms GetChildren and GetParent do not need rankc and selectc operations for symbols c associated to leaves).
For any alphabet Σ, there exists a succinct representation for xbw(T) that takes at most tH0(Sα)+2t+o(t) bits and supports parent(u), child(u, i) and child(u, α) queries in O(log |Σ|) time. If |Σ|=O(polylog(t)), navigational queries take O(1) time.
Subpath Search in xbw(T)
The present invention provides a new, powerful operation: subpath query. Given a path p of labels from Σ, the subpath query returns all nodes u, which is an element of T such that there exists a path leading to u labeled by p. Note that u and the origin of p could be internal to T. Subpath query is of central interest to the XPATH query language in XML and is also used as a basic block for supporting more sophisticated path searches. Still, no prior method is known for supporting subpath queries on trees represented succinctly.
In finding the nodes u of T whose upward path, i.e., the path obtained by concatenating the labels from u to the root, is prefixed by the string β=q1 . . . qk, with qi, which is an element of ΣN. (Subpath queries with downward paths can be handled by reversing the path in the query.) This embodiment of the present invention implicitly counts the number of subpaths matching β in T. Because of the sorting of S, the triplets corresponding to the nodes in the output of the subpath query are contiguous in S. Their range is denoted with S[First, Last], so that Sπ[First, Last] are exactly the rows of Sπ prefixed by β.
The method of SubPathSearch in
Thus, a subpath search can be performed for a string β in O(|β| log |Σ|) time for a general alphabet Σ, and in optimal O(|β|) time if |Σ|=O(polylog(t)).
Tree Entropy and Compression
The present invention goes beyond succinctness and provides for data structuring labeled trees using bits proportional in number to the inherent entropy of the tree as well as the labels. For a string of symbols, the notion of entropy is well developed, understood, and exploited in indexing and compression: high-order entropy depends on frequency of occurrences of substrings of length k. For trees, there is information or entropy in the labels as well as in the subtree structure in the neighborhood of each node. The simplest approach to labeled tree compression is to serialize it using say pre or postorder traversal to get a sequence of labels and apply any known compressor such as gzip. This is fast and is used in practice, but it does not capture the entropy contained in the tree structure. Another approach is to identify repeated substructures in the tree and collapse them a la Lempel-Ziv methods for strings. Such methods have been used for grammar-based compression, but there is no provable information-theoretic analysis of the compression obtained.
A more formal approach is to assume a parent-child model of tree generation in which node labels are generated according to the labels of their parents, or more generally, by a descendant or ancestor subtree of height k, for some k. From these approaches heuristic algorithms for tree compression have been derived and validated only experimentally. In XML compression, well-known software like XMILL and XMLPPM, group data according to the labeling path leading to them and then use specialized compressors ideal for each group. The length k of the predictive paths may be selected manually or automatically by following the classical PPM paradigm. However, even for these simplified approaches, no formal analysis of achievable compression is known. Despite work on labeled tree compression methods in multiple applied areas, there is no theory of achievable compression with associated lower and upper bounds for trees.
The present invention enables a more formal analysis of labeled tree compression. Based on the standard parent-child model of tree generation we define the kth order empirical entropy Hk(T) of a labeled tree T that mimics on trees, in a natural way, the well-known definition of kth order empirical entropy over strings. Further, based on the xbw transform according to an embodiment of the present invention, an method is provided to compress (and uncompress) T in O(t) time, getting a representation with at most tHk(T)+2.01t+o(t) bits. This is off from the lower bound of representing just the tree without labels (roughly 2t) by a factor that mainly depends on the entropy of the instance and may be significantly smaller than the O(log |Σ|) term that is the worst case over all inputs. While such results have been previously known for string compression, the present invention is the first application of entropy-based compression for trees. Again, these results rely on the xbw transform.
The locality principle exploited in universal compressors for strings is that each element of a sequence depends most strongly on its nearest neighbors, that is, predecessor and successor. The context of a symbol s is therefore defined on strings as the substring that precedes s. A k-context is a context of length k. The larger is k, the better should be the prediction of s given its k-context. The theory of Markov random fields extends this principle to more general mathematical structures, including trees, in which case a symbol's nearest neighbors are its ancestors, its children or any set of nearest nodes. In what follows we extend, in a natural way, the notion of k-context for string data to the notion of k-context for labeled-tree data. Letting π[u] be the context of node u; the k-context of u is the k-long prefix of π[u] denoted by πk[u]. The context πk[u] should be a good predictor for the labels that might be assigned to u. A larger k induces a better prediction. Similar to string data, similarly labeled nodes descend from similar contexts, and that the similarity of contexts is proportional to the length of their shared prefix.
Node labels get distributed in xbw(T) according to a pattern that clusters closely the similar labels. Given the sorting of S the longer is the shared prefix between π[u] and π[v], for any two nodes u and v, the closer are the labels of u and v in Sα. There is a powerful homogeneity property over Sα that is the analog of the properties of the standard Burrows-Wheeler transform on strings for labeled trees.
To measure the compressibility of a labeled tree, we proceed as follows. Let β be a string drawn from an alphabet of h symbols, and let bi be the number of occurrences of the ith symbol in β. We define the zeroth order empirical entropy on strings as usual: H0(β)=Σhi=1(bi/|β|) log (bi/|β|). Given H0, the k-th order empirical entropy on labeled trees can be defined. For any positive integer k, the notation cover[p] can be used to denote the string of symbols labeling nodes in T whose context is prefixed by the string ρ. Set Hk(T)=|cover[ρ]| H0(cover[ρ]), where ρ is ρ is an element of Σk. Thus, Hk(T) extends to labeled trees the notion of kth order empirical entropy introduced for strings, with the k-contexts reinterpreted “vertically” by reading the symbols labeling the upward paths.
Given the structure of xbw(T), for any k, the concatenation of strings cover[ρ] for all possible ρ where ρ is an element of Σk taken in lexicographic order, gives the string Sα. This is therefore the strong property holding for the BW transform here generalized to labeled trees. As a consequence and given the definition of Hk, tree compression problems up to Hk(T) can be treated similarly to a string compression problem up to H0. This enables all the machinery developed for string data to be applied to tree compression.
In locating the strings cover[ρ] within Sα, For a fixed k, it is enough to use the longest common prefix (Icp) information derived from the PathSort method. The substrings cover[ρ] are delimited by Icp-values smaller than k, and provide a partition of Sα. As an alternative, an adaptation of the compression boosting technique may be used which finds an optimal partition of Sα ensuring a compression bounded by Hk(T) for any positive k.
If A is a compressor that compresses any string w into |w|H0(w)+μ|w| bits, then the string xbw(T) can be compressed in tHk(T)+t(μ+2)+o(t)+gk bits, where gk is a parameter that depends on k and on the alphabet size (but not on |w|). The bound holds for any positive k, whereas the approach is independent of k. The time and space complexities are both O(t) plus the time and space complexities of applying A over a t-symbol string.
Thus, the present invention uses path-sorting and grouping to linearize data in the form of a labeled tree T into two coordinated arrays, one capturing the structure of the data and the other the labels of the data. Using the properties of the xbw transform, provides the first-known (near-)optimal results for succinct representation of labeled trees with O(1) time for navigation operations, independent of Σ and the structure of T; optimally supports the powerful subpath search operation for the first time; and introduces a notion of tree entropy and present linear time algorithms for compressing a given labeled tree up to its entropy beyond the information-theoretic lower bound averaged over all tree inputs.
The present invention may be implemented using hardware, software or a combination thereof and may be implemented in one or more computer systems or other processing systems. In one embodiment, the invention is directed toward one or more computer systems capable of carrying out the functionality described herein. An example of such a computer system 200 is shown in
Computer system 200 includes one or more processors, such as processor 204. The processor 204 is connected to a communication infrastructure 206 (e.g., a communications bus, cross-over bar, or network). Various software embodiments are described in terms of this exemplary computer system. After reading this description, it will become apparent to a person skilled in the relevant art(s) how to implement the invention using other computer systems and/or architectures.
Computer system 200 can include a display interface 202 that forwards graphics, text, and other data from the communication infrastructure 206 (or from a frame buffer not shown) for display on the display unit 230. Computer system 200 also includes a main memory 208, preferably random access memory (RAM), and may also include a secondary memory 210. The secondary memory 210 may include, for example, a hard disk drive 212 and/or a removable storage drive 214, representing a floppy disk drive, a magnetic tape drive, an optical disk drive, etc. The removable storage drive 214 reads from and/or writes to a removable storage unit 218 in a well-known manner. Removable storage unit 218, represents a floppy disk, magnetic tape, optical disk, etc., which is read by and written to removable storage drive 214. As will be appreciated, the removable storage unit 218 includes a computer usable storage medium having stored therein computer software and/or data.
In alternative embodiments, secondary memory 210 may include other similar devices for allowing computer programs or other instructions to be loaded into computer system 200. Such devices may include, for example, a removable storage unit 222 and an interface 220. Examples of such may include a program cartridge and cartridge interface (such as that found in video game devices), a removable memory chip (such as an erasable programmable read only memory (EPROM), or programmable read only memory (PROM)) and associated socket, and other removable storage units 222 and interfaces 220, which allow software and data to be transferred from the removable storage unit 222 to computer system 200.
Computer system 200 may also include a communications interface 224. Communications interface 224 allows software and data to be transferred between computer system 200 and external devices. Examples of communications interface 224 may include a modem, a network interface (such as an Ethernet card), a communications port, a Personal Computer Memory Card International Association (PCMCIA) slot and card, etc. Software and data transferred via communications interface 224 are in the form of signals 228, which may be electronic, electromagnetic, optical or other signals capable of being received by communications interface 224. These signals 228 are provided to communications interface 224 via a communications path (e.g., channel) 226. This path 226 carries signals 228 and may be implemented using wire or cable, fiber optics, a telephone line, a cellular link, a radio frequency (RF) link and/or other communications channels. In this document, the terms “computer program medium” and “computer usable medium” are used to refer generally to media such as a removable storage drive 214, a hard disk installed in hard disk drive 212, and signals 228. These computer program products provide software to the computer system 200. The invention is directed to such computer program products.
Computer programs (also referred to as computer control logic) are stored in main memory 208 and/or secondary memory 210. Computer programs may also be received via communications interface 224. Such computer programs, when executed, enable the computer system 200 to perform the features of the present invention, as discussed herein. In particular, the computer programs, when executed, enable the processor 204 to perform the features of the present invention. Accordingly, such computer programs represent controllers of the computer system 200.
In an embodiment where the invention is implemented using software, the software may be stored in a computer program product and loaded into computer system 200 using removable storage drive 214, hard drive 212, or communications interface 224. The control logic (software), when executed by the processor 204, causes the processor 204 to perform the functions of the invention as described herein. In another embodiment, the invention is implemented primarily in hardware using, for example, hardware components, such as application specific integrated circuits (ASICs). Implementation of the hardware state machine so as to perform the functions described herein will be apparent to persons skilled in the relevant art(s).
In yet another embodiment, the invention is implemented using a combination of both hardware and software.
Example embodiments of the present invention have now been described in accordance with the above advantages. It will be appreciated that these examples are merely illustrative of the invention. Many variations and modifications will be apparent to those skilled in the art.
Compression and Searching XML Data
XML is fast becoming the standard format to store, exchange, and publish data over the Web. XML data is also often included or embedded or in applications. XML is popular because it encodes a considerable amount of metadata in its plain-text format. Thus, applications can be more savvy about the semantics of the items in the data source. However, XML presents two challenges: size—the XML representation of a document is significantly larger than its native state, and search complexity-XML search involves path and content searches on labeled tree structures.
These challenges occur because, first, XML documents have a natural tree structure, and many of the basic tasks that are easy on arrays and lists—such as indexing, searching, and navigation—become more involved. Second, XML documents are wordy by design, repeating nearly the entire schema description for each data item. Thus, XML can be inefficient and burden a company's network, processor, and storage infrastructures. Finally, XML documents have mixed elements with both text and numerical or categorical attributes causing XML queries to be richer than commonly used SQL queries. For example, XML queries may include path queries on the tree structure and substring queries on contents.
In an exemplary embodiment, the present invention can be applied to provide access to contents, navigation, and searching of XML data in compressed form. Using the above described methods, a succinct tree representation can be used to design and implement a compressed index for the XML data, in which the XML data is maintained in a highly compressed format. Navigation and searching can be performed by uncompressing only a tiny fraction of the data. As discussed above, this method involves compressing two arrays derived from the XML data. This embodiment of the present invention overcomes the prior problems associated with XML data by providing a compression ration up to 35% better than the ones achievable by previous methods and enabling search operations that are orders of magnitudes faster. For example, search operations with the present invention may be performed in a few milliseconds over hundreds of MBs of XML files whereas previous methods only provide for search operations on standard XML data sources typically require tens of seconds.
As the relationships between elements in an XML document are defined by nested structures, XML documents are often modeled as trees whose nodes are labeled with strings of arbitrary length drawn from a usually large alphabet Σ. These strings are called tag or attribute names for the internal nodes, and content data for the leaves (shortly Pcdata). Thus, managing XML data or documents requires efficient support of navigation and path expression search operations over their tree structure.
Exemplary navigation operations are:
Find the parent of a given node u, find the ith child of u, or find the ith child of u with some label.
Exemplary path expressions are, basic search operations that involve structure and content of the XML document tree such as:
Given a labeled subpath π and a string γ, find either the set of nodes N descending from π in d (π may be anchored to any internal node, not necessarily tree's root), or the occurrences of string γ as a substring of the Pcdata contents of N's nodes.
Previous methods of finding parents and children nodes in the tree involved a mixture of pointers and hash arrays which are both space and time consuming. Navigation required a scan of the whole document. Certain methods increased navigation in succinct space but did not support search operations or provided for certain search operations but greatly increased the required space. Compressors provided for compression, but required scanning of the entire document and decompression of large portions of the document to perform searching and navigation.
Applying the method of the present invention to XML data overcomes the time-efficient versus space-efficient dichotomy by applying a modified xbw transform to XML data in order to represent a labeled tree using two arrays. The first array contains the tree labels arranged in an appropriate order. The second array is a binary array encoding the structure of the tree.
In this embodiment, the xbw transform is modified to better exploit the features of XML documents.
A collection of XML compression and indexing functions may be created and stored in a library. The library may then either be included in software, or it can be directly used at the command-line with a full set of options for compressing, indexing, and searching XML documents.
This embodiment of the XBW transform as a compressor, one embodiment of which is illustrated as XBzip in
Studies show that this is comparable in its compression ratio to the state-of-the-art XML-conscious compressors (which tend to be significantly more sophisticated in employing a number of heuristics to mine some structure from the document in order to compress “similar contexts”). In contrast, the XBW transform automatically groups contexts together by a simple sorting step involved in constructing the two arrays. In addition, XBzip is a principled method with provably near-optimal compression.
A second embodiment of the present invention applied to XML data is illustrated in
XBzipIndex has additional features and may find other applications besides compressed searching. For example, it supports tree navigation (forward and backward) in constant time, allows the random access of the tree structure in constant time, and can explore or traverse any subtree in time proportional to their size. This could be used within an XML visualizer or within native-XML search engines such as XQueC and F&B-index. There are more general XML queries like twig or XPath or XQuery; and XBzipIndex can be used as a core to improve the performance of known solutions. Another such example is that of structural joins which are key in optimizing XML queries. Previous work involving summary indexes, or node-numbering such as Vist or Prüfer are improved using XBzipIndex.
A. Compact Representation of DOM Trees
Given an arbitrary XML document d, an ordered labeled tree T which is equivalent to the DOM representation of d consists of four types of nodes defined as follows:
1. Each occurrence of an opening tag <t> originates a tag node labeled with the string<t.
2. Each occurrence of an attribute name a originates an attribute node labeled with the string @a.
3. Each occurrence of an attribute value or textual content of a tag, say ρ, originates two nodes: a text-skip node labeled with the character =, and a content node labeled with the string Øρ, where Ø is a special character not occurring elsewhere in d.
The structure of the tree T is defined as follows (see
1. The XBW Transform for XML Data
Tree T can be compactly represented by adapting the XBW transform discussed above. The XBW transform uses path-sorting and grouping to linearize the labeled tree T into two arrays. As shown in, this “linearized” representation is usually highly compressible and efficiently supports navigation and search operations over T. Note that we can easily distinguish between internal-node labels vs. leaf labels because the former are prefixed by either <, or @, or =, whereas the latter are prefixed by the special symbol Ø.
Let n denote the number of internal nodes of T and let l denote the number of its leaves, so that the total size of T is t=n+l nodes. For each node u, which is an element of T, let α[u] denote the label of u, last[u] be a binary flag set to 1 if and only if u is the last (rightmost) child of its parent in T, and π[u] denote the string obtained by concatenating the labels on the upward path from u's parent to the root of T.
To compute the XBW transform we build a sorted multi-set S consisting of t triplets, one for each tree node (see
1. Visit Tin pre-order; for each visited node u insert the triplet s[u]=<last[u], α[u], π[u]> in S;
2. Stably sort S according to the π-component of its triplets;
3. Form XBW(d)=<Ŝlast, Ŝα, Ŝpcdata>, where Ŝlast=Slast[1, n], Ŝα=Sα[1, n], and Ŝpcdata=Sα[n+1, t].
Since sibling nodes may be labeled with the same symbol, several nodes in T may have the same T-component (see
We notice that the XBW transform defined in Step 3 is slightly different from the one introduced in [11] where XBW is defined as the pair <Slast, Sα>. The reason is that here the tree T is not arbitrary but derives from an XML document d. Indeed we have that Sα[1, n] contains the labels of the internal nodes, whereas Sα[n+1, t] contains the labels of the leaves, that is, the Pcdata. This is because if u is a leaf the first character of its upward path π[u] is = which we assume is lexicographically larger than the characters < and @ that prefix the upward path of internal nodes (see again
As described above, a linear time algorithm for retrieving T given <Slast, Sα>. Since it is trivial to get the document d from XBW(d) we have that XBW(d) is a lossless encoding of the document d. XBW(d) takes at most (17/8)n+l bytes in excess to the document length. However, this is an unlikely worst-case scenario since many characters of d are implicitly encoded in the tree structure (i.e., spaces between the attribute names and values, closing tags, etc).
XBW(d) was usually about 90% the original document size. Moreover, the arrays Ŝlast, Ŝa, and Ŝpcdata are only an intermediate representation since we will work with a compressed image of these three arrays (see below).
Finally, it is possible to build the tree T without the text-skip nodes (the nodes with label =). However, if we omit these nodes Pcdata will appear in Sα intermixed with the labels of internal nodes. Separating the tree structure (i.e. <Ŝlast, Ŝα>) from the textual content of the document (i.e. Ŝpcdata) has a twofold advantage: (i) the two strings Ŝα and Ŝpcdata are strongly homogeneous hence highly compressible, (ii) search and navigation operations over Tare greatly simplified.
2. Why XBW(d) Compresses Well
Suppose the XML fragment of
In other words, there is a substring of Sα consisting of all the data (immediately) enclosed in an <author> tag. Similarly, another section of Sα contains the labels of all nodes whose upward path is prefixed by, say, =@id<book and will therefore likely consists of id numbers. This means that Sα, and therefore Ŝα and Ŝpcdata, will likely have a strong local homogeneity property.
Most XML-conscious compressors are designed to “compress together” the data enclosed in the same tag since such data usually have similar statistics. The above discussion shows that the XBW transform provides a simple mechanism to take advantage of this kind of regularity.
In addition, XML compressors (e.g. X
3. Navigation and Search Using XBW(d)
Every node of T corresponds to an entry in the sorted multiset S (see
1. Let u1, . . . , uc be the children of a node u. The triplets s[u1], . . . , s[uc] lie contiguously in S in this order. The last triplet s[uc] has its last-component set to 1; the other triplets have their last-component set to 0.
2. Let v1, v2 denote two nodes with the same label (i.e., Sα[v1]=Sα[v2]). If s[v1] precedes s[v2] in S, then the children of v1 precede the children of v2 in S.
For every internal node label β, we define F(β) as the rank of the first row of S such that Sπ is prefixed by β. Thus, for the example of
To efficiently navigate and search T, in addition to XBW(d) and the array F, we need auxiliary data structures for the rank and select operations over the arrays Ŝlast and Ŝα. Recall that given an array A[1, n] and a symbol c, rankc(A, i) denotes the number of times the symbol c appears in A[1, i], and selectc(A, k) denotes the position in A of the k-th occurrence of the symbol c.
The pseudocode of the procedure for computing the rank of the children of the node with rank i is shown in
The procedures for navigation and SubPathSearch have a similar simple structure and are straightforward adaptations of similar procedures introduced above. The only nontrivial operations are the rank and select queries mentioned above. Note that navigation operations require a constant number of rank/select queries, and the SubPathSearch procedure requires a number of rank/select queries proportional to the length of the searched path.
B. Computation of the XBW transform
To build the tree Twe parse the input document d using an XML parser. For example, in one embodiment, the Expat library by James Clark may be used. Handlers are set to build the tree T from one hundred MBs of XML data is a few seconds. As described above, given Tone can compute XBW(d) in time linear in the number of tree nodes.
1. Compression of XBW(d): the XBzip Tool
For a compressed (non-searchable) representation of the XML document d, one simply needs to store the arrays Ŝlast, Ŝα, and Ŝpcdata as compactly as possible. This is done in one embodiment by the XBzip tool whose pseudocode is given in
This strategy usually offers superior performance in compression because it is able to capture repetitiveness in the tree structure.
As we observed above, the arrays Ŝ′α and Ŝpcdata are locally homogeneous since the data descending from a certain tree path is grouped together. Hence, we expect that Ŝ′α and Ŝpcdata are best compressed splitting them in chucks according to the structure of the tree T.
2. Supporting Navigation and Search: the XBzipIndex Tool
Navigation and search operations, in addition to XBW(d), require data structures that support rank and select operations over Ŝlast and Ŝα. Rank/select data structures may be applied with theoretically efficient (often optimal) worst-case asymptotic performance. In another embodiment, practical methods can be applied. In particular, XBzipIndex and its pseudocode as shown in
For the array Ŝlast—search and navigation procedures only require rank1 and select1 operations over Ŝlast. Thus, a simple one-level bucketing storage scheme may be employed. In an embodiment, a constant L (default is L=1000), and Ŝlast is partitioned into variable-length blocks containing L bits set to 1. For each block the following is stored:
Rank1 and select1 operations over Ŝlast can be implemented by decompressing and scanning a single block, plus a binary search over the table of 1-blocked ranks.
The array Ŝα contains the labels of internal nodes of T. In this embodiment, it is represented using again a one-level bucketing storage scheme: Ŝα is partitioned into fixed-length blocks (default is 8 Kb) and for each block the following is stored:
Since the number of distinct internal-node labels is usually small with respect to the document size, β-blocked ranks can be stored without adopting any sophisticated solution. The implementation of rank β(Ŝa, i)/select β(Ŝα, i) derives easily from the stored information.
The array Ŝpcdata is usually the largest component of XBW(d). Ŝpcdata consists of the Pcdata items of d, ordered according their upward paths. The procedures for navigating and searching T do not require rank/select operations over Ŝpcdata. Thus, a representation of Ŝpcdata may be used that efficiently supports XPath queries of the form //_[contains(.,γ)], where π is a fully-specified path and γ is an arbitrary string of characters. To this end a bucketing scheme where buckets are induced by the upward paths may be used. Formally, let Sπ[i, j] be a maximal interval of equal strings in Sπ. We form one bucket of Ŝpcdata by concatenating the strings in Ŝpcdata[i, j]. In other words, two elements of Ŝpcdata are in the same bucket if and only if the have the same upward path. Every block will likely be highly compressible since it will be formed by homogeneous strings having the same “context”. For each bucket the following information is stored:
A counter of the number of Pcdata items preceding the current bucket in Ŝpcdata.
Using this representation of Ŝpcdata, allows an answer to the query //_[contains(.,γ)] as follows (see procedure ContentSearch in
In addition to the above data structures, this also requires two auxiliary tables: the first one maps node labels to their lexicographic ranks, and the second associates to each label β the value F[β]. Due to the small number of distinct internal node labels in real XML files, these tables do not need any special storage method.
Example embodiments of the present invention have now been described in accordance with the above advantages. It will be appreciated that these examples are merely illustrative of the invention. Many variations and modifications will be apparent to those skilled in the art.
This application is based upon and claims the benefit of priority from the prior U.S. Provisional Application No. 60/789,582 filed on Apr. 6, 2006, the entire contents of which are incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
60789582 | Apr 2006 | US |