The invention relates in general to databases and more particularly to semi-structured data processing using tree automata.
A database (DB) is a collection of information organized in a structured way so that the information can easily be retrieved, managed and updated. The data in a DB is organized according to a model. There are several such models. The dominant models, such as relational models, are structured. We call a DB with a structured model a structured DB. A structured DB contains a collection of database files. Each DB file is a collection of records. A record is a set of fields. A field is a content of a certain data type: numeric, character, logic, date, etc. A DB schema is a description in a formal language of this structure. The DB schema defines the files, the fields in each file and the relationships between fields and files in the DB. A structured DB has a single known schema. The term structured DB refers to any DB model: relational, native, object oriented etc. that stores data as described in this paragraph. A relational DB is an example of a structured DB.
Query languages enable to retrieve data from a specific file or from multiple related file in a database. Retrieval requests from a DB are expressed by a query language. Each type of a data model such as relational, semi-structured, etc. has its own set of query languages. The common standard relational query language is SQL.
Database indexes speed up data retrieval. They are much the same as book indexes, providing the database with quick jump points on where to find the full reference. The indexes are additional data structures that store references to the actual records. For example, an index can be a hash table that stores all the records references in buckets, sorted by their values in a specific field. When the user requests to retrieve all the records that contain a field with this value then the DB retrieves the requested records by first retrieving the references from the hash table and then retrieving the actual records from the file one by one. It is faster than performing a full traverse (scan) of all the file records.
When data is retrieved from multiple related files, the DB join mechanism combines the records from these DB files. It creates a joined records set and returns it to the user. A join mechanism must efficiently join the records from different files according to a join criterion that relates the multiple DB files. Efficient indexing and join operations extract a minimal number of records from the DB files.
A labeled graph is a pair (G, label) of the graph G and of a label function. The graph G is the pair (V, E) where V is a set of nodes and E is a set of edges that connect them. A node up is called a parent of a node vc if there is an edge (up, vc) that connects between them. The node vc is called a child of node up. The label function maps each node to a label.
A semi-structured model is a database model that presents data through a labeled graph. In this model, there is no separation between the data and the schema, and the amount of graph structure used depends on the usage goals.
The nodes in the semi-structured model store the data. The nodes in the graph model are equivalent to records and fields in the structured model. They are equivalent to fields because they store the data itself. They are equivalent to records because they refer to a collection of fields—the node children. Thus, instead of having a fixed records set, the representation is done by a graph. Instead of a single schema as in a structured DB, the DB schema of a semi-structured model describes a collection of labeled graphs which the DB can accept. This enables the flexibility of a semi-structured data query.
In order to support this flexibility, the semi-structured data query languages are richer than structured model query languages. They express two criteria types: the graph structure and the stored data in the graph nodes. The data criteria are the same as in structured DB query languages. The data criteria describe a Boolean expression of the data values. The structural criteria express relations between nodes in the graph structure. There are several structural relation types. The most common are: 1. parent-child (denoted hereinafter as P-C): If up is a parent of node vc than (vp, vc) has a parent-child relation; 2. ancestor-descendant (denoted hereinafter as A-D): if node va is a parent of node vd or, in recursion, if nodes (vc, vd) has an A-D relation, where node vc is a child of node va, then nodes (va, vd) have an ancestor—descendant relation.
Semi-structured query languages are not standardized. Recently, a “twig-pattern” was suggested as a formal representation for the structural criteria of semi-structured languages, see e.g. N. Bruno, N. Koudas, and D. Srivastava, “Holistic twig joins: optimal XML pattern matching”, Proceedings of SIGMOD, 2002 (hereinafter “BKS”). A twig pattern has a labeled-tree form. XML stands for eXtensible Markup Language. A labeled-tree is the same as labeled graph except the graph has a form of a “tree”. A tree is a graph with the following constraints: all nodes but one have a single parent. The exceptional node is called a root and has no parents. A node without children is called a leaf. The labels of a twig tree are a subset of the queried semi-structured data labels. The twig pattern also maps each edge to a nodes-relation type: A-D or P-C. The twig-pattern can express the structural portion of queries which are written in one of many semi-structured query languages. Given a twig pattern Q and semi-structured data D, a match of Q in D is identified by a mapping from nodes in Q to nodes in D, such that the P-C and A-D relations between query nodes are satisfied by the corresponding D nodes. An “answer” to twig pattern Q with n nodes can be represented as an n-ary relation where each tuple (d1, . . . , dn) consists of D nodes that identify a distinct match of Q in D.
There is a common way (see BKS) to store semi-structured data in a structured DB. Each node in the semi-structured model is considered to be a record. The record contains fields that encode the location of the node in the tree. When extracting two records from a file, the A-D and P-C relations can be determined for these records by the encoded location. Many such encodings exist. The records are either split into files according to node labels or stored in a single file and contain a field with the node label. The order of the records in the file is determined by some top-down or bottom-up traversal order: pre-order, post-order, etc. Each node-record has identification (ID) which is the order in which the traversal takes place.
In view of the inefficiencies in retrieving semi-structured data and getting answers to queries on such data in a database environment, there is a need for and it would be advantageous to have methods that perform such actions more efficiently.
In this invention, the twig-pattern inputs are being formalized as automata. In this section we give the background that is need for the understanding of this formalization. We explain three concepts: regular expression, Finite State Automata (FSA) and Tree Automata (TA).
A regular expression is an expression that describes a set of strings. They are usually used to give a concise description of a set, without having to list all elements. Regular expressions consist of constants and operators that denote sets of strings and operations over these sets, respectively. Given a finite alphabet Σ the following constants are defined: Ø (empty set), ε (empty string) denoting a string with no characters, a denoting a character in the language. The following operations are defined: concatenation RS denoting the set {αβ|α in R and β in S}. alternation R|S denoting the set union of R and S and R* denoting the smallest superset of R that contains ε and is closed under string concatenation. This is the set of all strings that can be made by concatenating zero or more strings in R. For example, {“ab”, “c”}*={ε, “ab”, “c”, “abab”, “abc”, “cab”, “cc”, “ababab”, “abcab”, . . . }. Examples: a|b* denotes {ε, a, b, bb, bbb, . . . }. (a|b)* denotes the set of all strings with no symbols other than a and b, including the empty string: {ε, a, b, aa, ab, ba, bb, aaa, . . . }.
Finite State Automata supply an alternative way to describe a set of strings. The input to a finite state automaton is a string of input symbols. For each input symbol, the automaton performs a transition to a state given by a transition function which is designated inside the FSA. The transition updates the current state of the automaton. When the last input symbol is received then the automaton either accepts or rejects the string depending on whether the current state is an accepting or a non-accepting state of the automaton. This way, the automaton recognizes a specific collection of strings.
More formally, FSA is a tuple (Σ, Q, q0, δ, F), where: Σ is the input alphabet (a finite, non-empty set of symbols); Q is a finite non-empty set of states; q0 is an initial state which is an element of Q; δ is the state-transition function from a source state and a symbol into a target state and F is a subset of Q of an accepting states.
Tree Automata describe sets of trees. A bottom-up TA that process trees from the “bottom” of the tree, which is in the tree leafs, to the “top” of the tree that is the root. An input to a bottom-up TA is a labeled tree whose labels are the input symbols. It traverses the tree from leafs to the root. The TA annotates state to each node, according to a transition function. The TA makes a transition from the states, which were annotated to the children node and from the node label to a state given by the transition function. When the root state is reached, it either accepts or rejects the tree depending on whether the root state is in an accepting or a not accepting state. This way, the TA describes a specific set of trees.
More formally, a bottom-up finite tree automaton over a finite state F is defined by: (Q, Σ, F, δ) where Q is a set of states, Σ is a final set of input symbols, F is a subset of final states, and δ is a set of transition rules, which rewrite rules from a string, composed from the children states, to a parent state. Thus, the state of a node is deduced from the states of its children. There is no initial state but the transition rules for constant symbols (leaves) can be considered as initial states. The tree is accepted if the state at the root is an accepting state.
There are two types of bottom-up TA: “ranked trees” and “unranked trees”. The difference is in the transition function. Ranked trees have a finite set of children for each parent. Ranked tree transitions are for a finite set of children states—one for each node. A node in an unranked tree can have any number of children. Therefore, an unranked TA transition must express varying number of states. In order to express this condition, the transition is extended by regular expressions. The children of a transition are described by a regular expression over the automaton states. The transition is made if a string, which is composed of reachable children states, is accepted by the transition regular expression. A run is a mapping from tree vertices to their annotated states.
The invention discloses techniques that speed up retrieval of semi-structured data from a database. To achieve this, the invention uses two fundamental structured DB operations: indexing and join. In this description, “joining” means performing a join operation. The invention provides methods to efficiently perform indexing and join for semi-structured models. The main advantage of a semi-structured model is its flexible format for data exchange between different types of DBs. The primary trade-off being made in using a semi-structured model is that queries cannot be answered efficiently as in a structured DB. The invention aids to eliminate this tradeoff by speeding up a query processing.
Answers to queries in a database environment are obtained very efficiently by processing twig-patterns and by performing holistic indexing and/or holistic join operations on the semi-structured data based on unordered twig-patterns. The queries may be received from any client or application, for example from a computer program, a database client, a web browser, a web service. etc. The semi-structed data may be stored in any known type of database, for example a relational DB, a native DB, a distributed DB, etc. The answers are returned to the client or application which submitted the query.
The pattern processing is performed over semi-structured data modeled as a tree (i.e. “tree-structured data”). XML is an example for such tree-structured data. In order to utilize the advantages of a structured DB and of a semi-structured DB, we store semi-structured data in a structured DB. In this way, the data retrieval according to a data criterion is efficient because the data is structured, and the format is flexible because it has a semi-structure model. What is missing in known methods is the ability to efficiently retrieve data according to structural criteria. This ability is provided by the invention and enables to process semi-structured queries efficiently from a structured DB.
The semi-structured data schema and query in this invention are modeled by tree-automata. The invention presents the first application of TA to indexing and join operations on semi-structured data stored in a structured DB. The invention models all the components of the semi-structured data, i.e. twig-pattern, data and schema, as tree automata. The TA processes ordered trees.
A new tree automata version is developed of a bottom-up unranked TA for unordered trees. Unordered trees are trees in which the order of node children has no meaning. We call these automata Unordered Unranked TA (denoted hereinafter as UUTA). Hereinafter the general TA term refers to a UUTA.
The join and indexing operations share a common primitive operation which is a selection of node records from the semi-structured data. These node records match nodes in the twig-pattern. We use a holistic selection operation on the on trees, which selects data nodes that match nodes in a twig pattern only if these selected data nodes are part of a whole twig answer. The holistic selection means that the all the twig-pattern constraints are checked in the same processing phase. The holistic selection is different from first selecting nodes in each P-C or A-D relation and then joining the results. The holistic selection is also different from selecting nodes in each separate path of the twig-pattern and then joining the partial-answers.
The known TA have the ability to perform a holistic selection operation on trees by using Selecting Tree Automaton (STA), see e.g. C. Koch “Efficient Processing of Expressive Node-Selecting Queries on XML Data in Secondary Storage: A Tree Automata-based Approach”. VLDB pages 249-260, 2003. A STA is a tree automaton which is extended with a set of selecting states. The selecting states tell the tree automaton which tree nodes to select. Nodes annotated by selecting states in the automaton run are selected and returned as output. The performing of a holistic selection operation on trees using an STA is referred to hereinafter as STA(T).
In a DB context, the semi-structured data tree structure can be too large to be modeled by the machine internal memory. Therefore we use TA as a compact description of the tree structure. In this invention, we develop a new holistic select operation that selects states in a TA. The data nodes derived by the TA states comprise a complete twig-pattern answer in one of the trees that the TA describes. The new holistic operation uses a STA. The performing of a holistic selection operation on a tree automaton A using a STA is referred to hereinafter as STA(A).
The holistic selection checks the twig-pattern constraints on the TA instead of on the tree. This operation enables an accurate extraction of records from the DB files. This accuracy in record extraction is what makes the operation extremely efficient.
According to the invention, there is provided a computer implented method for obtaining answers to queries in a database environment, comprising the steps of: forming tree automata; processing semi-structured data stored in a database the processing based on the TA to provide indexed data; pruning the indexed data to obtain pruned data; and joining either the pruned data or the semi-structured data to provide the answers to the queries.
According to the invention, there is provided a computer implented method for obtaining answers to queries in a database environment, comprising the steps of forming tree automata and using the TA, joining semi-structured data stored in a database to provide the answers to the queries.
The invention is herein described, by way of example only, with reference to the accompanying drawings, wherein:
The main steps of the twig pattern processing are shown in the flow chart in
An index is constructed and operated in step 105. The construction and operation include processing semi-structured data using the TA to provide indexed data and pruning the indexed data to obtain pruned data. In steps 115 and 120, the TA is used to join either the pruned data (step 115) or the input semi-structured data (step 120) in order to provide answers for the semi-structured queries. Step 110 checks if the join receives the pruned data as input. When the join operates without the index, it receives the semi-structured data as input.
The flow chart in
The indexing operation (steps 210-220) has two phases: offline and online. The offline phase (step 220) constructs the index as follows: it runs a schema-TA on the semi-structured data and maps data nodes to schema-automaton states that annotate the semi-structured data. This mapping is the index. The offline phase is done once. The online phase (steps 210 and 215) prunes the data according to the twig-TA. The online phase performs a holistic selection operation on the schema TA (step 210) and selects schema TA states which derive data nodes that match the twig-pattern. Then, step 215 prunes the indexed data nodes according to the selected schema TA states. Step 215 outputs to DB files the semi-structured data nodes mapped by the index to the selected schema TA states.
The join operation (steps 225-245) extracts iteratively a fixed number of data nodes from the pruned DB files (step 225). Then, the join operation constructs a prediction-automaton from these fixed number of extracted nodes (step 230). This prediction-automaton predicts the entire tree structure of the data. Next, the join performs the holistic select operation and selects prediction-automaton states that derive semi-structured data nodes which compose the twig-pattern. (step 235). Steps 210 and 235 perform the same STA(A) operation. Finally, the join operation outputs paths of data nodes which were derived by the selected prediction-automaton states (step 240). The output paths are partial answers in the twig-pattern. The join operation sorts the paths and joins them into answers (step 245). The answers are a set of tuples. Each tuple (d1, . . . , dn) consists of database node records that identify a distinct match of a twig-pattern in the semi-structured data.
Each of the steps in
Schemas of semi-structured data, which have a tree data model, can be represented as unranked TA (see M. Murata, D. Lee, M. Mani: “Taxonomy of XML Schema Languages Using Formal Language Theory. Extreme Markup Languages”, pages 153-166, 2001). The existing TA processes ordered trees. Herein, we suggest a new version of unranked TA for unordered trees. Unordered trees are trees in which the order of nodes children have no meaning. We call this version “bottom-up UUTA”. Hereinafter, “TA” is used to mean “UUTA”. The UUTA structure resembles a TA with the structure (Q, Σ, F, δ), but the transition function is different. Each transition is from a set of children states and a parent label to a parent state. The UUTA run generates transitions for reachable children states, one per child, to a parent state. Although the transition children sets of states are finite, the processed tree is unranked because all the children with the same state contribute a single state to the transition.
The input schema is represented as an unranked TA. We construct here a UUTA that recognizes, without considering the order, the same trees as the input unranked TA. The UUTA construction uses the same states while the transitions are changed. The algorithm constructs from each unranked TA transition a set of UUTA transitions. Each transition contains a different set of children states that can be expressed by the regular expression of the unranked TA. The following pseudo code, called “Construct-UUTA function”, describes this construction:
The Construct-UUTA function calls the Construct recursive function. The Construct function constructs the sets of states from regular sub-expressions recursively according to the operator type. For simplicity, we assume that each regular expression operator operates on at most two sub-expressions. We denote these sub-expressions as Left (L) and Right (R).
Example for the operation of the Construct function described above: for the TA transition δ(‘(qa|qb)*&qc’,a)=qd, we build the following UUTA transitions: δ({qa,qc}, a)=qd, δ({qb,qc},a)=qd and δ({qc},a)=qd. The following table shows how the sets of symbols are constructed from the regular expression ‘(qa|qb)*&qc ’ of the transition. Each row describes one recursion step the operator type. The state sets inputs of the left and right sub-expression and the returned set of states.
A twig query is defined as (T, label, type) where T is the tree T=(V,E), the label function maps each node to a label in Σtwig and the type function maps each edge to its nodes-relation type. The Twig UUTA is the tuple (Qtwig, Σtwig, Ftwig, δtwig). Each node v is mapped into two states: qv, and quv, in Qtwig. The root of the twig pattern has no label. We denote it with the special label ⊥. The final state is a root state denoted q⊥.
The Twig TA Construction algorithm iterates over the nodes and edges of the twig tree as follows:
1) For each node v, it determines the subset of children Vdescendant that are connected to v in an A-D relation. Each subset Vdescendant⊂Vdescendant contributes a transition δtwig(S,label(v))=qv where each child uεS contributes state
to S. These transitions guarantee that if all the children and descendants states are reached then the parent state is reached. We construct a transition for each subset Vdescendant⊂Vdescendant because child is also a descendant.
2). For each edge (v,u) with an A-D relation and for each label aεΣtwig, two additional transitions are added: δtwig((qu),a)=quu and δtwig((quu),a)=quu. These rules ensure that parent node will accept its descendants.
δtwig({ },d)=qd, δtwig({ },c)=qc, δtwig({qc,qd},b)=qb, δtwig({quc,qd},b)=qb, δtwig({qb},⊥)=q⊥, δtwig({qub},⊥)=q⊥.
For each a εΣtwig, the flowing transitions are constructed from the twig pattern edges:
δtwig({qc},α)=quc, δtwig({quc},α)=quc, δtwig({qb},α)=qub, δtwig({qub},α)=qub.
The offline operation in step 220 is described in further detail in steps 405 and 415. The offline operation maps the schema-TA states to semi-structured data node records. A record of a node v is mapped to a state schema-TA q if it is reached by q in a bottom up run. However, this condition is not sufficiently accurate, because node v could be reached by more than one state. A state identifies the sub-tree rooted in v. It does not consider which TA states are reached by nodes in a top-down path from the root to node v.
Step 405 represents the STA (T) operation that maps the data nodes to relevant schema-TA states according to bottom-up and top-down considerations. The TA can only decide if a tree T is either accepted or not accepted. To be able to select nodes in a tree, we extend the TA by an additional mechanism for selecting nodes. STA becomes STA=(A, S) where A is a tree automaton, which defines the processed trees and S is a set of selecting states of A. The STA (T) maps the nodes in T to states in S. A node v in T is mapped to state q if and only if there is an accepting run of A on tree T in which a state of vertex v is one of the selecting states in S.
Step 415 represents the inverse mapping of the selected nodes. The result is an index which maps from schema-TA states to relevant data vertices. The online part receives the index, the schema-TA and the twig-TA as inputs.
The online operation in step 210 is described in further detail in steps 425 and 445. It selects schema-TA states that can reach tree nodes which are matched by a twig pattern that defines the twig-TA. This operation is called STA(A) (step 425). In order for the STA(A) to work, we convert the schema-TA to accept sub-trees as described in step 445. The twig-TA accepts sub-trees in the semi-structured data. The schema-TA accepts the complete trees of the semi-structured data. Step 445 ensures both schema-TA and twig-TA are processing same collection of sub-trees. The output of the STA (A) is the selected schema-TA states' mapping to the twig-TA states that recognizes the matched twig-pattern nodes. The merge operation in step 435 receives this mapping and returns vertices that are mapped to the selected schema-TA states in the index. Step 435 is the same as step 215. The output of the indexing is the pruned data.
The index operation performs a holistic selection operation. It selects schema-TA states only if they can express nodes that are matched by the whole twig-pattern. All the other existing indexing techniques such as R. Goldman and J. Widom, “Dataguides: Enabling query formulation and optimization in semistructured databases”, in Twenty-Third International Conference on Very Large Data Bases, pages 436-445, 1997, (hereinafter GW), are not holistic. They cluster nodes according to their labels of nodes in the path from the root. These indexes select the clusters according to labels in the paths of nodes in the twig pattern. The selection according to separate paths is less accurate and thus, less efficient because it extracts more records from the DB files.
The inverse operation in step 415 is described next in more detail. The input to the inverse operation is the selected nodes map from step 405. The input maps the selected nodes records of the semi-structured data to the selecting states of the schema-TA that annotated these nodes. The inverse operation maps the selecting states of the schema-TA to the selected nodes records of the semi-structured data.
Example: the top-down run in
The construction of sub-trees TA in step 445 is described next. The Construct sub-trees TA operation receives as an input a UUTA A which recognizes a collection of trees and constructs a new UUTA Asub-trees that recognizes sub-trees of trees recognized by A. In the indexing context, we use this operation to transform the schema-TA, which is given in step 460 and recognizes semi-structured data trees, to schema sub-trees TA in step 450. The twig-TA, which is given in step 470, also recognizes the collection of sub-trees in the semi-structured data. After the transformation is completed, we operated on both collections of sub-trees. The pseudo-code that describes the construction of the sub-trees is given next:
Example: we get as input the schema-TA from the example and construct sub-tree TA (Qs.t., Σs.t., Fs.t., δs.t.) where Qs.t.={q1t, q2t, q3t, q4t, q5t, q6t, q7t, q8t}, Σs.t.={a, b, c, d} Fs.t.={q1t} and δs.t. contains the following transitions: δs.t.({ },c)=q3t, δs.t.({ }),c)=q5t, δs.t.({ },d)=q6t, δs.t.({ },d)=q8t, δs.t.({q3t},b)=q2t, δs.t.({ },b)=q2t, δs.t.({q5t,q6t},b)=q4t, δs.t.({q5t},b)=q4t, δs.t.({q6t},b)=q4t, δs.t.({ }),b)=q4t, δs.t.({q8t},b)=q7t, δs.t.({ },b)=q7t, δs.t.({q2t,q4t,q7t},a)=q1t, δs.t.({q4t,q7t},a)=q1t, δs.t.({q2t,q7t},a)=q1t, δs.t.({q2t,q4t},a)=q1t, δs.t.({q2t},a)=q1t, δs.t.({q7t},a)=q1t, δs.t.({q4t},a)=q1t, δs.t.({ },a)=q1t.
The STA (A) operation extends the STA(T) operation. The STA(T) selects a tree node v in the tree T if there is an accepting run of the selecting TA that maps v to a selecting state. The STA(T) mechanism is extended to operate on a collection of trees that is accepted by a TA. The extended operation STA(A) maps a state qtree of the tree-TA to selecting selecting state qselecting of the selecting-TA if there is a tree T, which is accepted by both automata and a node v that mapped to qtree in a run of the tree-TA and to qselecting in a run of the selecting-TA.
The merge operation of step 435 is described next in more detail. The inputs to the merge operation are the index from the offline phase and the selected nodes map from the online phase. The index maps schema-TA states to nodes records of the semi-structured data. The selected-states map selected schema-TA states to twig-TA states. The merge operation iterates over the selected schema-TA states which are the keys of the selected-states map. For each selected schema-TA state, it extracts its mapped nodes records in the index and inserts it into a new DB file. The rest of the records are filtered from the file. The pruned nodes records have to be reordered in a traversal order according to the order in the semi-structured DB file.
In the description of the inverse processing (step 415), we gave an example from the output of the offline phase. It constructed an index that contains the mappings: index[q4t]=6, index[q5t]=7, index[q6t]=8. The output of the example in
The FSA construction in step in step 505 is described next. The FSA recognizes the execution of a TA on a collection of all the trees. A TA state qp exists in the FSA if there is a tree T and a node vp in tree T such that the run of the TA on T maps vp into qp. There is an edge (qp,qc) if there is a tree T and nodes vp, vc in tree T such that the run of the TA on T maps vp into qp and vc into qc and vp and vc has a P-C relation. The following pseudo-code describes the FSA construction. In each cycle, the algorithm checks if transitions from the existing children states to a new parent states exist.
Example how the Construct-FSA constructs FSA from the UUTA: we get as input the schema-TA (Qt, Σt, Ft, δt) where Qt={q1t, q2t, q3t, q4t, q5t, q6t, q7t, q8t}, Σt={a, b, c, d} Ft={q1t} and δt contains the following transitions: δt({ },c)=q3t, δt({ },c)=q5t, δt({ },d)=q6t, δt({ },d)=q8t, δt({q3t},b)=q2t, δt({q5t,q6t},b)=q4t, δt({q8t},b)=q7t, δt({q2t,q4t,q7t},a)=q1t. The example in
The bottom-up traverse in step 515 is described next. The input to the bottom-up traverse is the semi-structured data tree and the selecting-TA. The semi-structured data is stored in a structured DB. We do not need to reconstruct the tree in order to traverse it. Instead, we use a stack (see CLRS) to store the node records during the tree traversal. The ID of the nodes records in the data file may be ordered in any top-down or bottom-up traversal using for example a DFS (see CLRS) traversal. The traversal in the algorithm is bottom up. If the traversal order in the algorithm is top down, like in this example, we read the records in the file in reverse—from the end to the start. A reverse-DFS orders the nodes in a bottom-up order. The algorithm is described in the following pseudo-code:
In order to give a bottom-up traversal example, we first give an example for a semi-structured tree data stored in DB files. An example for semi-structured data is given in
Each node is denoted by a circle. The label of the circle of node v is in the format ‘v; label (v)’. The edges are denoted by arrows. The node IDs in the figures are the DFS (see CLRS) traversal order of the tree nodes. The DB file in this example contains the following records: (1,a), (2,b) (3,c), (4,b), (5,c), (6,b), (7,c), (8,d), (9,b), (10,d), (11,b), (12,d). The DB can also split the nodes into files according to their labels. In this example Filea contains node 1. Fileb contains nodes 2, 4, 6, 9, 11. Filec, contains nodes 3, 5, 7. Filed contains nodes 8, 10, 12.
The following table describes part of the bottom-up run. The nodes are extracted in reverse order. Each row in the table defines iteration. In each row, the table stores the node r that is extracted in this iteration, the node-records Stack at the end of the iteration and states that were add run[r].
The inputs to the top-down traverse are the run of step 515, the semi-structured data, the selecting FSA, which was constructed in the offline phase, and the set of selecting states. In the indexing context, the selecting TA is the schema TA. We need to map every node to a state and therefore we select all states and Sin=Qin. In order for the index to become minimal, we use a STA that selects a single state for each node in the tree. Below is the pseudo-code that describes the top-down traverse:
The STA(A) algorithm resembles the STA(T) algorithm in step 405 with some differences. The offline phase is the same as the offline phase in the STA(T) algorithm. The offline phase in step 1020 constructs a selecting FSA from the selecting TA. It is done in the same way as done by STA(T) in step 505. The online phase in steps 1025-1050 is different. Instead of traversing a tree, the algorithm intersects in step 1025 the selecting TA and the tree TA. The output from this intersection is called an intersected TA. The STA(A) algorithm constructs in step 1040 the intersected FSA from the intersected TA. The construction process is the same as the selecting FSA construction in steps 1020 and 505. The top-down traversal (step 1050) uses the selecting FSA to traverse the intersected FSA states. Each FSA state contains two components: selecting TA state and tree TA state. The top-down traversal selects states of the intersected FSA that contain a selecting state. The output is a mapping of tree-TA states to selecting TA states.
The intersection between two UUTA A1=(Q1, Σ, F1, δ1) and A2=(Q2, Σ, F2, δ2) is the automaton A1∩A2=(Q1×Q2, Σ, F1×F2, δ1×2) where δ1×2(S1,S2,a)=q1,q2 only if δ1×2(S1,a)=q1, and δ1×2(S2,a)=q2.
Exemplarily, the Selecting FSA is constructed from the twig-TA of the pattern in
The intersected FSA is constructed from the intersection between the selecting TA input, which is the twig TA, and the tree TA input, which is the schema sub-trees TA. FSA has a single start state. We add to the tree-TA a final state, which replaces the original final state. The new final state has a transition from each original-TA accepting state with label ⊥. The intersected FSA is illustrated in
Dead states are states that cannot be reached from the start state. We denote the dead states by dotted lines. The top-down traverse in step 1050 does not select these states because they are not reachable.
The top-down traverse process of step 1050 is described next. The inputs to the top-down traverse are the intersected FSA and the selecting FSA which is constructed in the offline phase. A selecting twig state qv is a twig state that was constructed from twig-pattern nodes v. The qv state identifies the nodes that match the twig-pattern of node v. The top-down traverse uses a recursive function to traverse the intersected FSA. The parameter qtreep,qselectingp is the current node which is traversed in the intersected FSA. The function is called with the root node q0
In the example (
The join operation in step 115 and 120, which is also described in steps 225-245, is now described in more detail. In the join operation, we utilize the same STA(A) mechanism used for the indexing operation. The algorithm iteratively models parts of the data as a tree-TA. It also models the twig pattern as a twig-TA and then it uses the STA (A) mechanism to select node records that partially match the twig-pattern. Algorithms that process a twig-pattern in holistic way are known, for example N. Bruno, N. Koudas, and D. Srivastava, “Holistic twig joins: optimal XML pattern matching”, Proceedings of SIGMOD, pages 310-321, 2002. However, they do the processing heuristically and therefore they do not process the exact twig pattern. The semi-structured DB files are inputs to the join operation. Each label has a different DB file. A file of records with label a is denoted by filea. When a twig-pattern is processed the join mechanism is used to join the node records from multiple filea where a εΣtwig.
The actions in step 1315 are described next. The algorithm iteratively traverses the semi-structure data. In each iteration, the algorithm extracts (step 1615) a finite number of nodes from the DB files. Step 1615 is the same as step 225. Step 1620 checks whether node records were extracted from the DB files. If new nodes were extracted, then the algorithm constructs a prediction automaton from the extracted nodes. This construction, which is given in step 230, is detailed in steps 1630-1640. In step 1630, a prediction-tree (TPrediction) is formed from the current extracted nodes of step 1620. The prediction-tree predicts the tree structure of the entire data. The prediction-tree structure reflects the structure for the current extracted nodes and for the nodes which have not yet been extracted from the DB files. These nodes are called future-nodes. We denote by future position the minimal start position for filea for all a εΣtwig of nodes records which have not yet extracted by the algorithm. The future position advances.
The prediction-tree receives its name because it predicts the future-nodes structure from the currently extracted nodes. The prediction-tree has two vertices types: real-vertices (VReal) and virtual-vertices (Vvirtual). A real-vertex is mapped into a single extracted node record. A virtual-vertex indicates the existence of a gap in our understanding of the structure of the data. A virtual data defines labels and positions of multiple future-nodes which may appear in the semi-structured data between the real-vertex parent of the virtual-vertex and its real-vertices children. The prediction-tree combines two sub-trees: TCurrent and TFuture. The TCurrent is composed entirely from real-vertices. TFuture contains a mixture of real and virtual vertices. The TFuture nodes are identified by future position IDs. From the prediction-tree, the algorithm constructs tree automata in step 1640.
Next, more details of step 235 are given in steps 1650 and 1660. In step 1650, the algorithm constructs from the prediction automaton a sub-tree automaton. The sub-trees automaton construction is given in step 1650. This construction is the same as the construction in step 445. Step 1660 performs the STA(A) operation described in step 425. The input to the STA(A) is the twig-TA. The STA(A) result is the selected states in the prediction TA, which also constitute the selected nodes in the prediction-tree. The selected-nodes are input to the next iteration for the node-extraction process in step 1615. If the node extraction process does not extract nodes from the DB files, then the algorithm outputs partial twig-pattern answers that exist in TCurrent. The output process is done in step 1670, which is the same as step 240.
This section describes the prediction tree construction in step 1630. This construction includes two tasks:
1) Construction of real-vertices for the prediction-tree. A real-vertex is identified by the start position of its node. It is located in the prediction-tree in the same place as its location between its minimal ancestor vertex and its minimal descendants in the original semi-structured data. The recordsPrediction function maps each real vertex to the extracted node.
2) Construction of virtual-vertices for the prediction-tree. A virtual-vertex v fills the gap between the real-vertex of its parent p and the real-vertices of its children where each child is denoted by c in the prediction tree. The positionsPrediction function maps v to startu where u is a future node. A node u appears between p and c. Therefore, u is a descendant of the recordsPrediction[p] but not a descendant of recordsPrediction[c]. Therefore, startu is located between startp and endp but not in between startc and endc. startu is also bigger or equal to future position in order to be a future-node. v is assigned to be the minimal startu. The label function maps v to labelu in node u which appears between p and c in the data. labelu can be a if u can be in filea. We check the next future record ra, in filea. If startr
The pseudo code of the prediction tree construction algorithm is given below:
The above pseudo code uses Construct Real Vertices as an internal function that constructs real vertices. Its code is described below:
The construct prediction tree algorithm, described above, also uses an internal function that constructs virtual vertices whose code is given below:
The prediction-tree is initialized to be the empty tree.
The prediction of TA construction (step 1640) is described next. This algorithm constructs ATree from TPrediction. The states of ATree are the TPrediction vertices. Two construction rules construct the prediction-tree into an ATree:
1) Real vertex rule: constructs transitions which annotate real-vertices. A real-vertex is annotated when all its children exist;
2) Virtual vertex rule: constructs transitions which annotate virtual-vertices. A virtual-vertex is annotated when a subset of its children exists.
The constructed ATree accepts (in TA sense) the collection of all the predicted trees. The following pseudo code describes the prediction TA construction algorithm:
The algorithm constructs the (Qout, Σout, Fout, δout) UUTA from the prediction-tree in
The virtual vertex rule constructs the following transitions: δout({ },a)=2, δout({2},a)=2, δout({3},a)=2, δout({2, 3},a)=2, δout({ },α)=7, δout({12},α)=7, δout({14},α)=7, δout({7},α)=7, δout({12, 14},α)=7, δout({14, 7},α)=7, δout({12, 7},α)=7, δout({12, 14, 7},α)=7.
After the construction of the prediction TA, we give examples to steps 1650 and 1660. The prediction TA is converted to accept in the sub-trees in step 1650. Then, the STA(A) operation in step 1660 selects the prediction TA states that match selecting twig TA states. This operation returns the mapping of the selected prediction TA states to the selecting twig TA states.
This section describes the nodes extraction in step 1615. To limit memory usage, the algorithm uses a fixed number of nodes records K from each DB file to construct the prediction tree. The algorithm first takes the selected nodes records from the previous iteration. If the vertices in the TCurrent were outputted in the previous iteration then they are not taken in the current iteration. The following pseudo code describes the node extraction:
The output of the partial answers in step 1670 is described next. When all the real vertices in the prediction tree are selected, then the record paths, which are mapped into nodes in TCurrent, are added to the output. The output maps from the selecting twig states, which identify nodes in the twig pattern, to paths of node records which expected to be part of the twig answers. The output portion of the algorithm traverses TCurrent top-down like step 525. This output processing uses the twig FSA and the selected-states (step 1665) from the previous iteration as inputs to the top-down traversal. The following pseudo code describes this recursive algorithm. The algorithm's inputs are: a parent vertex in TCurrent and a twig-TA state that selects it. It operates on the parent-vertex children. If a child is not selected then the recursion is passed to it with the same state. Otherwise, the algorithm finds transitions to the selected child states and passes the recursion to the found child and state. If a selecting state of a child indicates a new tree pattern than the path is cleared.
Iterations (d), (e) and (g) add paths to the partial answer. In these iterations, all the real-vertices are selected by the STA (A) operations. Therefore, TCurrent is an output.
The partial output answers are given in the table below. There are three twig answers: Two of the answers are rooted in node 0. The other is rooted in node 8. We see in the table that the paths are not ordered according to the traversal order. For example, qb starts with path 8/11 and only then it moves to path 0/9.
This section describes the actions in step 1325. These actions are common relational DB actions. The inputs are the partial twig-pattern answers of step 1315. The first action is sorting of the partial answers as shown in the table below. The second action is an operation of merge-join algorithm (for more details, see C. J. Date Introduction to Database System) on the sorted partial answers. The merge-join traverses all lists of records and joins paths with equal common path. In this way, three solutions are returned as answers for (qa, qb, qc, qd, qe) are (0, 19, 3, 20, 24), (0, 19, 9, 20, 24) and (8, 11, 9, 12, 14).
The indexing algorithm was compared against the GW index, which is the most accurate non-holistic index. The data guide group together nodes which have the same labels on the path from the root. We tested indexing by using twig-patterns that were randomly generated to prune indexed data. We used three datasets in the experiments: TreeBank (Marcus, M., Santorini, B., Marcinkiewicz, M.: “Building a large annotated corpus of English: the Penn Treebank” in Computational Linguistics, vol. 19, pages. 297-352, 1993) XMark (A. R. Schmidt, M. L. Kersten, M. A. Windhouwer, F. Waas. Efficient Relational Storage and Retrieval of XML Documents. In International Workshop on the Web and Databases pages. 47-52, 2000) and DBLP (Michael Ley, Patrick Reuther: Maintaining an Online Bibliographical Database: The Problem of Data Quality. EGC pages 5-10, 2006). We considered two parameters: 1) accuracy. i.e. the percent of nodes that matched the twig-pattern out of the extracted nodes; and 2) coverage, i.e. the number of twig patterns the holistic index improves the accuracy. The experiment showed that the holistic index improves the accuracy in about 30% against the data guide. For more complex queries the improvement is even more evident. The holistic index can be up to ten times more accurate than the non-holistic data guide. The coverage for complex patterns is about 80%.
The join algorithm presented in this invention (denote by TwigTA hereinafter) was compared against TwigStack (see BKS) and iTwigJoin (see T. Chen, J. Lu, and T. W. Ling “On boosting holism in XML twig pattern matching using structural indexing techniques”, SIGMOD, pages 455-466, 2006) indexes. The TwigStack is the holistic join method that was first suggested. The iTwigJoin is a holistic join method that combines non-holistic indexing. We tested the join methods on the Treebank and XMark data sets. We consider the following performance metrics to compare between the performances of twig pattern matching algorithms which are based on three streaming schemes: 1) the number of extracted node records; 2) the number of produced intermediate paths; and 3) the running time.
As seen in
In terms of running time, For XMark (
XML is a semi-structured textual data that has a tree model. All of the major suppliers of infrastructure products embrace XML and related standards as core technologies. The invention builds an infrastructure to implement XML across an enterprise and between organizations. Efficient storing and manipulation of terabytes of XML data becomes a critical task. This invention enables efficient query processing of XML data stored in a structured DB. Examples for XML applications where the invention is a major critical component in a system that implements it are given below. Most of the examples describe systems that have the general architecture of
XML has surpassed SGML as the preferred method of adding application- and vendor-neutral descriptive markup to documents. The publishing industry, which uses XML to separate between form and content, also uses databases to attach metadata to documents. Publishers use XML-based content management. The invention supports content-management products for multiple media, including WEB clients, mobile devices, CD-ROM and print. The semi-structured data model supplies the ability to describe and manipulate and store rich hierarchies of content. Access to a structured DB enables to attach attributes (known as metadata) to the stored semi-structured data nodes records.
Content-management servers (2305 in
XML is at the core of Web services. Protocols such as SOAP, XML-RPC, and JMS enable software components to communicate with each other via XML dialects. Messaging applications require high throughput, rapid generation and ingestion of messages, the ability to query message payloads, extensible attributes, integrated XSLT transformation capabilities, and interfaces with standard APIs. The invention supports these activities.
XML on a structured DB (2300 in
XML is a low-cost replacement for electronic data interchange (EDI) implementations. Structured business documents, such as purchase orders and bill presentments, can be expressed as XML documents that can be delivered asynchronously and without the need for direct application-to-application integration, as was done with first-generation EDI implementations. In XML-based implementations, the systems are loosely coupled, rather than tightly integrated, because the data can be passed as an XML document that can be validated against a schema to ensure common definitions and enforce DOM fidelity (order of elements, namespaces, etc.) as well as to maintain fidelity to the original form of the data.
Structured DB (2300 in
Mass marketers such as WalMart use suppliers which have access to a WalMart database that stores the status of the merchandise items in WalMarts' 7000 stores. The merchandise data is stored as a semi-structured data in a structured DB (e.g. 2300 in
e-Business (Tying Legacy Systems to Web Applications through XML)
Increasingly, XML is being used as “glue” to bind legacy software applications to e-business front ends that deliver information to customers over the Web. A typical scenario is to transform the data in the legacy application to XML in order to hand it off to the new e-business application. As e-business projects grow in complexity, developers will want support for generating XML views over relational and other existing data. To be done efficiently, such application development requires integration with adaptors or gateways to create normalized XML views over multiple structured and semi-structured data.
Storing semi-structured data in structured DB (e.g. 2300 in
e-Government
Different database programs are a major problem in e-government projects in many countries. Consider, for instance, accessing a governmental portal in order to use a particular service (2315 in
Although one may have entered the e-government website via a single portal, behind the scenes the data required for these activities will typically be held in several different proprietary database systems. This is because of the long history of piecemeal implementation of databases in local government. Typically there will be no common standard for coding the data fields in these databases. For example, in one system, addresses might have fields with names such as House number, Street name, Town, City, Postcode and so on. Another system might have Address1, Address2, and Address3 instead. This is an example of the “legacy problem”. In many cases, it is too expensive to replace these diverse systems with new, integrated systems operating on common standards. Somehow, the older systems have to be incorporated into the newer e-government systems and have to be able to work together with them. A vital tool for enabling these diverse systems to work together has been XML. Fast querying of XML that is stored on structured DB enables solution to these problems. The data from these systems can be migrated into XML and stored as semi-structured data. The semi-structured data along with fast query processing that this invention enables produce a reliable e-government application.
The various features and steps discussed above, as well as other known equivalents for each such feature or step, can be mixed and matched by one of ordinary skill in this art to perform methods in accordance with principles described herein. Although the disclosure has been provided in the context of certain embodiments and examples, it will be understood by those skilled in the art that the disclosure extends beyond the specifically described embodiments to other alternative embodiments and/or uses and obvious modifications and equivalents thereof. Accordingly, the disclosure is not intended to be limited by the specific disclosures of embodiments herein. For example, any digital computer system can be configured or otherwise programmed to implement the methods disclosed herein, and to the extent that a particular digital computer system is configured to implement the methods of this invention, it is within the scope and spirit of the present invention. Once a digital computer system is programmed to perform particular functions pursuant to computer-executable instructions from program software that implements the present invention, it in effect becomes a special purpose computer particular to the present invention. The techniques necessary to achieve this are well known to those skilled in the art and thus are not further described herein.
Computer executable instructions implementing the methods and techniques of the present invention can be distributed to users on a computer-readable medium and are often copied onto a hard disk or other storage medium. When such a program of instructions is to be executed, it is usually loaded into the random access memory of the computer, thereby configuring the computer to act in accordance with the techniques disclosed herein. All these operations are well known to those skilled in the art and thus are not further described herein. The term “computer-readable medium” encompasses distribution media, intermediate storage media, execution memory of a computer, and any other medium or device capable of storing for later reading by a computer a computer program implementing the present invention.
Accordingly, drawings, tables, and description disclosed herein illustrate technologies related to the invention, show examples of the invention, and provide examples of using the invention and are not to be construed as limiting the present invention. Known methods, techniques, or systems may be discussed without giving details, so to avoid obscuring the principles of the invention. As it will be appreciated by one of ordinary skill in the art, the present invention can be implemented, modified, or otherwise altered without departing from the principles and spirit of the present invention. Therefore, the scope of the present invention should be determined by the following claims and their legal equivalents.
All patents, patent applications and publications mentioned in this specification are herein incorporated in their entirety by reference into the specification, to the same extent as if each individual patent, patent application or publication was specifically and individually indicated to be incorporated herein by reference. In addition, citation or identification of any reference in this application shall not be construed as an admission that such reference is available as prior art to the present invention.
This application claims the benefit of U.S. Provisional patent application No. 61/032,109 filed Feb. 28, 2008, which is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
61032109 | Feb 2008 | US |