Fast processing of an XML data stream

Description

BRIEF DESCRIPTION OF THE DRAWINGS

The invention is herein described, by way of example only, with reference to the accompanying drawings, wherein:

FIG. 1 shows an outline of XML processing;

FIG. 2 is an example of a finite state machine that accepts the language L={w|w has both an even number of 0's and an even number of 1's};

FIG. 3 shows the algorithmic flow of the sequential application of the offline and online algorithms of the present invention;

FIG. 4 is an example of an XML schema;

FIG. 5 illustrates the DFA_Schemaconstructed from the XML schema of FIG. 4;

FIG. 6 shows a deterministic finite automaton that defines the language of the regular expression “ab(n)*c”;

FIG. 7 shows an example of alternate-single transition-sequences, “/root/a/b” and “root/a/a/b”;

FIG. 8 shows an example of alternate-double transition-sequences, “/root/c/d”; and “/root/d”;

FIGS. 9
a-9d illustrate the removal of an unbound XPath-expression element d from the XPath-query /root//d/e;

FIG. 10 is pseudocode for the DFA reduction of the offline algorithm of the present invention;

FIGS. 11
a-11d show an example of the reduction process of the offline algorithm of the present invention;

FIG. 12 shows the application of Algorithm 1 to the DFA_Schemaof FIG. 5 and the XPath-query “/root/c/d”;

FIG. 13 is pseudocode for the online algorithm of the present invention;

FIG. 14 shows an XML document whose processing by the online algorithm of the present invention is illustrated in Table 1;

FIG. 15 shows the DFA_minXpaththat is used in the processing illustrated in Table 1;

FIG. 16 shows the mapping of the DFA_Schemaalphabet (a, b, c) onto the indices of the DFASchema transitions (1, 2, 3 and 4);

FIG. 17 shows the mapping of the DFA_Schemaand the XPath-query of FIG. 11 onto transition symbols FIGS. 18a-18e show the reduction of the DFA_Schemaof FIG. 17;

FIG. 19 is pseudocode of the DPDT parser of Averbuch et al. '307 as modified for the present invention;

FIG. 20 illustrates the extended algorithm of the present invention;

FIGS. 21 and 22 are partial high-level block diagrams of system for implementing the present invention;

FIGS. 23 and 24 are partial high-level block diagrams of hardware implementations of the present invention.

DESCRIPTION OF THE PREFERRED EMBODIMENTS

The principles and operation of XML query processing according to the present invention may be better understood with reference to the drawings and the accompanying description.

In what follows, we first describe the basic algorithm of the present invention and then describe the extended algorithm of the present invention. The prior art methods discussed above are designed to handle many concurrent XPath-queries. The extended algorithm of the present invention uses the basic algorithm of the present invention to handle a large number of XPath-queries as well.

One unique advantage of the present invention over prior art methods is that the optimization of the present invention works well also with small collections of queries.

Referring again to the drawings, the basic algorithm of the present invention (FIG. 3) is divided into two sequential parts:

- 1. Offline—constructs a DFA with minimal alphabet, denoted hereinafter by DFA_minXPath, which accepts L_answerover the minimal alphabet.
- 2. Online—uses the DFA_minXPathfrom the Offline part to provide an answer to an XPath-query in the XML data.

The offline part is called first and once. The offline part is called when a new XPath-query is assigned. The online part is iteratively called each time a document is streamed to the system.

The input to the offline algorithm is an XPath-query and a XML-Schema. The offline algorithm has the following consecutive parts:

- 1. DFA construction: Constructs:
  - a. DFA_schemafrom the input schema
  - b. DFA_queryfrom the input XPath-query
- 2. DFA reduction: generates a DFA with a minimal alphabet that accepts L_answer. This DFA is denoted herein by “DFA_minXPath”.

In FIG. 3, the operations are enclosed in ovals and the outputs of the operations are enclosed in dotted boxes.

The basic algorithm, whose three components are illustrated in FIG. 3, processes the XML data syntactically and semantically using formal language methodology. The basic algorithm defines the XML data as a formal language L_root. L_rootaccurately characterizes the aspects of the semistructured data that are needed to optimize the query processing. L_rootcontains only finite words. Therefore, L_rootbelongs to the class of regular languages.

Formal languages have been used before to define XML and other semistructured data. For example, tree languages are widely recognized as a presentation for semistructured data. But all these languages are too general to provide efficient algorithms to process queries.

The basic algorithm defines L_queryand L_schemaas regular languages. The initial step of the basic algorithm constructs a DFA_schemathat accepts L_schema, from the XML Schema. The construction is denoted by “1a” in the basic algorithm in FIG. 3. The DFA_schemacan also be constructed directly from L_rootusing grammatical induction methods. The use of schema, as it is defined above, is unique to the present invention.

The basic algorithm defines the query as a RE. The DFA, that is constructed from this RE, accepts L_query. This DFA is denoted herein by “DFA_query”.

The overall combined framework of the offline algorithm includes:

- 1. Constructions of DFA_schemaand DFA_query;
- 2. Manipulation (explained below) of the DFA_schemaand the DFA_queryto produce the DFA that accepts L_answeron a reduced alphabet language.

The DFA, which accepts L_answer, is denoted hereinafter by DFA_answer. DFA_minXPathis the DFA_answerwith the minimal alphabet.

- 3. Answer the query by intersecting L_answer∩L_root. The intersection is done by applying DFA_minXPathto L_root.

Steps 1 and 2 belong to the offline algorithm, while step 3 is the core of the online algorithm.

The present invention uses three operations on DFA_schemaand DFA_query:

- 1. Intersection: DFA_answer=DFA_schema∩DFA_query.
- 2. Completion: DFA_complement=DFA_schema∩ DFA_query. DFA_complementis the complement of DFA_answer, in other words, DFA_schema=DFA_answer∪DFA_complement.
- 3. Symbol removal: removes a symbol s from Σ*. Let h_sbe the homomorphism

$h_{s} : \sum \to Δ \overset{Δ}{=} \sum / s$

$such that$

$h_{s} (a) = {\begin{matrix} a & a \neq s \\ ɛ & otherwise \end{matrix}$

- where ε is the empty string. The homomorphism h_sis applied on alphabet Σ for L. After the removal of the symbol s, the language L is denoted by L^−s. A DFA that accepts the language L can be modified to accept the language L^−s. The modified DFA, which accepts L^−s, is denoted herein by DFA^−s

We define “redundant symbols” as follows: A symbol s is redundant in DFA_answerif and only if DFA^−s_answer∩DFA^−s_complement=Ø. If the symbol s is non-redundant, then there are two words w₁∈L_answerand w₂∈L_complementthat are merged to be the same word w after the removal of the symbol s. A non-redundant symbol is also called a “necessary symbol”.

The Basic Algorithm: Pseudocode

The following pseudocode (“Algorithm 1”) is pseudocode for the offline part in the basic algorithm of FIG. 3. This pseudocode removes redundant symbols from DFA_answer.

Inputs: XML Schema, XPath-query

Locals: DFA_answer, DFA_schema, DFA_query, DFA_complement

Return: DFA_min_XPath

Processing:

Construction of DFA_schemafrom XML Schema

Construction of DFA_queryfrom XPath-query

DFA_answer← DFA_schema DFA_query

while there exists a symbol s ∈ Σ_answersuch that DFA^−s_answer DFA⁻

^s_complement=Ø do

DFA_answer← DFA^−s_answer

DFA_complement←DFA^−s_complement

end while

DFA_min_XPath← DFA_answer

End Processing

The following pseudocode (“Algorithm 2”) is pseudocode for the online part of the basic algorithm of FIG. 3. This pseudocode applies the DFA_min_XPathon a single input path P, P∈L_root. The online basic algorithm is extended below to process multiple paths that share common prefixes (see FIG. 13).

The online pseudocode handles two transition types:

- Transitions of DFA_min_XPath;
- Transitions of redundant symbols.

Inputs: PathP in L_root

DFA_min_XPath= (Σ,Q,δ,q₀,F)

Locals: state q_current

Return: YES if P is an answer of the XPath-query and NO otherwise.

Processing:

q_current← q₀

while there exists a next symbol s in P do

if s ∈ Σ then

if δ(q_current,s) in δ then

q_current← δ(q_current,s)

end if

end if

end while

if q_current∈ F then return YES else return NO

End processing

The Basic Offline Algorithm when L_schemais an IL

If the symbol s is non-redundant, then, the two words w∈L_answerand w∈L_complement, which differ only in s, become the same word after the removal of s. The transitions of DFA_answerand DFA_complementare δ_answer=δ_schema×δ_queryand δ_complement=δ_schema× δ_queryrespectively. The special structure of an IA ensures that the words w and w′ are derived from a similar sequence of transitions in δ_schema. The two sequences of transitions are identical except the transitions to and from a single state that produces the symbols. Because of this similarity, we can identify a non-redundant symbol s in an IA by locally examining transitions that accept s and their neighboring transitions in the sequence. The DFA reduction (“2” in FIG. 3) constructs the transition patterns that identify non-redundant symbols.

FIG. 4 is an example of an XML Schema that we use in the following presentation to demonstrate various features of the algorithm of the present invention. This XML-Schema defines a “root” element with children elements “a” or “b” followed by element “c”. The elements “a” and “b” contain any combinations of the children elements “a” and “b”. The element “c” has a single child—the empty element

We construct the DFA_Schemafrom the XML Schema as follows:

- 1. The alphabet is a set of elements as defined in the XML Schema.
- 2. The states include one state for each element a in the XML Schema.
- 3. There exists a transition from a state A to a state B if according to the XML Schema element b is a possible child of element a.
- 4. The start state is an additional state with a transition to the root element state.
- 5. The final states are the states of all possible empty elements, where an “empty element” is an element with no children.

FIG. 5 illustrates the DFA_Schemaconstructed from the XML-Schema presented in FIG. 4. The DFA_Schemais constructed in the following way:

- 1. The alphabet (denoted by small letters): root, a, b and c
- 2. The states (denoted by capital letters): ROOT, A, B and C.
- 3. The final states (denoted by a double circle): A, B and D.

FIG. 5 shows that the constructed DFA is an IA.

We describe now the DFA reduction scheme when L_schemais an IA. The inputs to the DFA reduction process are the DFA_schemathat is constructed in step “1a” of FIG. 3 and the input XPath-query. This algorithm does not remove symbols from the DFA_answer. Instead, the algorithm removes redundant symbols from both inputs. Here, removal of symbol s (also called element removal) means: 1. Application of the homomorphism h_son DFA_Schemathat generates DFA_schema^−s; and 2. Removal of XPath-expressions that contain the element s from the XPath-query.

After the removal of all the redundant symbols from both inputs, which is described next, the algorithm constructs the DFA_Schemawith minimal alphabet that is called DFA_min_schema. In addition, the algorithm reduces the redundant XPath-expressions. The algorithm constructs the DFA_queryof the reduced XPath-query, which is called DFA_min_query. The algorithm constructs DFA_min_Xpath=DFA_min_schema∩DFA_min_query.

The first step in this reduction process checks if the XPath-query is valid for the schema. If the XPath-query is valid we identify the necessary-elements that can not be removed from the alphabet. For example, the element n in FIG. 6 is a necessary-element for the matching of the XPath-query ‘/a/b/c’. If the element n is removed from the alphabet of DFA_schema, then the self-transition in state A in DFA_schema⁻ⁿdoes not exist. After the removal of n, we are unable to determine if the transition-sequence from start-state to A, from A to B and from B to C originally matched the context ‘/a/b/c’. The original sequence may contain self-transitions n in state A. In this case, the original sequence does not match the context ‘/a/b/c’. A removal of the necessary element n causes two transition-sequences, that differ only by this element, to be identical. We call these two transition-sequences “alternate transition-sequences”. In other words, alternate transitions-sequences are two different transitions-sequences that share the same start-state and end-state. The two alternate transitions-sequences complement each other in their matching successes. For example, FIG. 6 shows a DFA that defines the language of the regular expression “ab(n)*c”. In FIG. 6, the transition-sequences 1) from start-state to A, from A to B and from B to C, 2) from start-state to A, from A to A, from A to B and from B to C, are alternate transition-sequence because sequence 1 matches the context ‘/a/b/c’ while sequence 2 does not match sequence ‘/a/b/c’ and the two sequences differ from each other by a single element n.

To remove an element from the alphabet, all the alternate transitions-sequences that differ only by this element are examined. Because DFA_Schemais an IA, it suffices to check only alternating transition-sequences that are different from each other by at most two transitions:

- 1. Alternate-single-transitions—different in a single self-transition
- 2. Alternate-double-transition—different in two transitions

Alternate transitions-sequences generate two words w∈L_answerand w′∈L_complementthat are different from each other by a single element s. For α,β∈Σ*, the element is one of two types:

- 1. Internal—w=αsβ and w′=αβ.
- 2. External—w=αβand w′=αsβ.

We now classify the occurrences of alternate transitions-sequences. Altogether there are four alternate transition-sequence patterns:

- 1. External-single: In this case, the element s is accepted by a self-transition where w=αβ and w′=αsβ. From the removal of element s, αβ∈L_answer^−sand αβ∈L_complement^−s. Therefore, s is not redundant. FIG. 7 illustrates this pattern for the XPath-query ‘\root\ab’. Let w=“root a b” and w′=“root a a b”. The self-transition a is part of the transition sequence from start-state to ROOT, from ROOT to A, from A to A and from A to B that accepts w′. This transition-sequence alternates with the transition sequence from start-state to ROOT, from ROOT to A and from A to B that accepts w.
- 2. Internal-single: In this case, the element s is accepted by a self-transition where w=αs,β and w′=αβ. From the removal of element s, αβ∈L_answer^−sand αβ∈L_complement^−s. Therefore, s is not redundant. Let w=“root a a b” and w′=“root a b”. FIG. 7 shows this pattern for the XPath-query ‘\root\a\a\b’. The self-transition a is part of the matched transition sequence from start-state to ROOT, from ROOT to A, from A to A and from A to B that accepts w. This transition sequence alternates with the transition sequence from start-state to ROOT, from ROOT to A and from A to B that accepts w.
- 3. Internal-double: Assume we have three states A, B and C. A is connected to B that accepts s, B to C that accepts c and A to C that accepts c. Assume that w=αscβ and w=αcβ. From the removal of element s, αcβ∈L_answer^−sand αcβ∈L_complement^−s. Therefore, s is not redundant. Let w=“root c d” and w′=“root d”. This pattern is illustrated in FIG. 8. In FIG. 8, for the XPath query ‘\root\c\d’ there are three states ROOT, C and D. ROOT is connected directly to C and C to D, and ROOT is also connected directly to D. In this case, the transition sequence from start-state to ROOT from ROOT to C and from C to D that accepts w, alternates with the transition sequence from start-state to ROOT from ROOT to D that accepts w′.
- 4. External-double: We use the same pattern as in pattern 3. In pattern 3, the two transitions (A to B and B to C) belong to the sequence that accepts w=αscβ. Here, the single transition A to C belongs to sequence that accepts w=αcβ. This pattern is illustrated in FIG. 8. For the XPath query ‘\root\d’ there are three states ROOT, C and D. ROOT is connected directly to C and C to D, and ROOT is also connected directly to D. In this case, the transition sequence from start-state to ROOT and from ROOT to D, that accepts w, alternates with the transition sequence from start-state to ROOT and from ROOT to C and from C to D that accepts w′.

In FIG. 7, we check whether element ‘a’ can be removed from the alphabet.

We check this for three different XPath-queries contexts:

- 1. Element ‘a’ in ‘/root/a/b’ is necessary because the transition that accepts ‘a’ is part of an external-single transition pattern (pattern 1). This pattern indicates the existence of two alternate transition sequences: 1. start-state to ROOT, ROOT to A and A to B; 2. start-state to ROOT, ROOT to A, A to A and A to B
- 2. Element ‘a’ in ‘/root/a/a/b’ is necessary because the transition that accepts ‘a’ is part of an internal-single transition pattern (pattern 2).

This pattern indicates the existence of two alternate transition sequences: 1. start-state to ROOT, ROOT to A and A to B; 2. start-state to ROOT, ROOT to A, A to A and A to B.

- 3. Element ‘a’ in ‘/root//a/b’ is redundant because two transition-sequences match the context ‘/root//a/b’. The sequences are: 1. start-state to ROOT, ROOT to A, A to A and A to B; 2. start-state to ROOT, ROOT to A and A to B.

In FIG. 8 we check whether element ‘c’ can be removed from the alphabet. We check this for three different XPath-queries contexts:

- 1. Element ‘c’ in ‘/root/c/d’ is necessary because the transition that accepts ‘c’ is part of an internal-double transition-sequence (pattern 3). This pattern indicates the existence of two alternate transition sequences: 1. start-state to ROOT, ROOT to C and C to D; 2. start-state to ROOT, ROOT to D.
- 2. Element ‘c’ in ‘/root/d’ is necessary because the transition that accepts ‘c’ is part of an external-double transition-sequence (pattern 4). This pattern indicates the existence of two alternate transition sequences: 1. start-state to ROOT, ROOT to C and C to D; 2. start-state to C, C to D.
- 3. Element ‘c’ in ‘/root//d’ is redundant because two transition-sequences match the context: 1. start-state to ROOT, ROOT to C and C to D; 2. start-state to ROOT and ROOT to D.

When we remove an unbound XPath-expression element, the reduction algorithm may produce an invalid XPath-query. Removal of an element is possible when a scenario of the type illustrated in FIG. 9 occurs. FIGS. 9a-9d show two examples of removal of an unbound XPath-expression element d from the XPath-query root//d/e. FIG. 9a shows the source schema of the first example. FIG. 9b shows the results after the removal of d from the source schema of FIG. 9a. FIG. 9c shows the source schema of the second example, FIG. 9d shows the results after the removal of d from the source schema of FIG. 9c. Element d in ‘/root//d/e’ (FIG. 9a) can be removed because transition d from ROOT to D exists. The removal of d merges the states ROOT and D. Therefore, the valid XPath-query, ‘/root//d/e’, that has a transition-sequence to E, remains valid when ‘/root//d/e’ is reduced to ‘/root/e’ as shown in FIG. 9b. The XPath-query ‘/root//d/e’ becomes invalid when transition d from state ROOT to state D does not exist (FIG. 9c). In this case, the accepted query ‘/root/e’ is invalid because the transition from ROOT to C disconnects between the transition from start-state to ROOT and the transition from C to E (see FIG. 9d).

The last XPath-expression element is always a necessary-symbol. For example, this is demonstrated by the XPath-query ‘/root/e’ in FIG. 9b. Element ‘e’ in this XPath-query is necessary. If we remove the element ‘e’, we get the XPath-query ‘/root’. But we can not determine if the matched context is either ‘/root/e’ or ‘/root’.

FIG. 10 shows pseudocode for the DFA reduction (part “2” in FIG. 3) in the offline algorithm.

FIGS. 11
a-11d show an example of the flow of the DFA reduction process of the offline algorithm for the XPath-query ‘/root/a/b’ and the DFA_Schemain FIG. 7. FIG. 11a shows the initial state. Element ‘a’ in FIG. 11b is accepted by a transition that is part of the sequence that accepts w=“root a b”. Element ‘a’ is a necessary-element because the transitions that accepts “a” are part of two transition patterns: a single-external transition pattern that indicates the alternate transition sequence that accepts w=“root a a b” and the double-internal transition pattern that indicates the alternate transition sequences that accepts w′=“root b”. Element ‘b’ in FIG. 11b is a necessary-element because a transition that accepts “b” is part of a double-external transition pattern. The transition pattern indicates that two alternate transition sequences exist that accepts: 1. w=“root a b”; 2. w′=“root b a b”. FIG. 11c shows the DFA_schemaof FIG. 7 after the reduction of elements c and d. FIG. 1d shows the reduction of the root element.

Two different procedures have been given herein for the removal of redundant elements. One procedure is presented above as the pseudocode of Algorithm 1. The other procedure is presented in FIG. 10. The two procedures are equivalent. In Algorithm 1 we use DFA_answerand DFA_complement. In FIG. 10, we use the notion of alternate transition sequence. We show now, using the example of FIG. 12, that DFA_answerand alternate transition sequences are interchangeable.

In FIG. 10, the offline algorithm processes two alternate transition sequences on DFA_schema(FIG. 5). One transition sequence accepts the word w=“root c d”∈L_answerand the other transition sequence accepts the word w=“root d” ∈L_complement. The transitions sequences are labeled ‘internal’ and ‘external’ to differentiate between these two sequences. The ‘external’ transition sequence of root.d is accepted by DFA_answer, (FIG. 12b) which is constructed from the intersection of DFA_query (FIG. 12a) and DFA_schema(FIG. 5). The ‘internal’ transition sequence of “root c d” is accepted by DFA_answer(FIG. 12d) which is constructed from the intersection of DFA_query(FIG. 12c) and DFA_schema(FIG. 5). The algorithm in FIG. 10 does not allow to remove the symbol c because symbol c generates an alternate-double-transition-sequence pattern. The alternate transitions sequence accepts two words: “root c d”∈L_complement. and “root d”∈L_complement. The two words differ only in symbol c. Algorithm 1 also does not allow to remove the symbol c. The removal of symbol c from the alphabet by the homomorphism in Algorithm 1 is illustrated in FIG. 12e.

(The homomorphism is described by DFA_answer^−c) Algorithm 1 considers symbol c as a necessary-symbol because the intersection between DFA_answer^−c(FIG. 12e) and DFA_complement^−c(FIG. 12b) is not empty since the word root.d exists in both. Therefore, an alternate transition sequence indicates that the intersection L_answer^−c∩L_complement^−c≠Ø. The word that is accepted by the transitions-sequence after removing symbol c is in the intersection.

The Online Algorithm

The online algorithm accepts a stream of XML data, necessary-elements and DFA_minXPath, which are the two outputs from the offline algorithm, and provides as an output the XML elements that match the context. The algorithm processes each element sequentially. The element can be a start-element or an end-element. The necessary-elements and DFA_minXPathare treated as global data.

The online algorithm uses a stack to store the DFA_minXPathstates. The states identify the common prefixes of the paths processed so far. At any given time there is a single active state. The algorithm uses the XML parser of Averbuch et al. '307 to implement the pseudocode of the online algorithm. The algorithm has three procedures that are called during the application of the XML parser:

- 1. Initialization in setup time
- 2. Receiving a start-element from the XML stream
- 3. Receiving an end-element from the XML stream
- Pseudocode that describes the online procedures is given in FIG. 13.

We demonstrate the operation of the online algorithm in Table 1 on the XML document shown in FIG. 14 and on the DFA_minXPathin FIG. 15. The stack alphabet of Table 1 contains the DFA_minXPathstates Q={q₀,q_A,q_B,q_e} of FIG. 15.

TABLE 1

Symbol
Stack
Operation
Matching

q₀
Init

<root>
q₀,
Start - skipped

<b>
q₀, q_e
Start

<b\>
q₀, q_e
Start, End

<\b>
q₀,
End

<a>
q₀, q_A
Start

<b>
q₀, q_A, q_B
Start
Bingo!

<b>
q₀, q_A, q_B, q_e
Start

<\b>
q₀, q_A, q_B
End

<\b>
q₀, q_A,
End

<a>
q₀, q_A, q_e
Start

<\a>
q₀, q_A
End

<\a>
q₀,
End

<\root>
q₀
End - skipped

Two different procedures are given herein for the XML online processing of XPath queries. The pseudocode of Algorithm 2 presents one procedure. The other procedure is presented in FIG. 13. FIG. 13 is an extension of Algorithm 2. In Algorithm 2, a single path from the XML root to its leaf is processed. For a single path, it is sufficient to store a single state q_currentthat belongs to DFA_minXPath. FIG. 13 describes XPath processing on all the paths in the XML tree. Processing each path alone is inefficient. The algorithm in FIG. 13 shares the processing of the common sub-paths from the root. After processing the common sub-paths, the states are stored in a stack.

Algorithm 2 processes the XML path iteratively. The algorithm in FIG. 13 contains the procedures that are called during the application of the XML parser of Averbuch et al. '307. The XML parser routine traverses the XML tree and uses the XPath procedures in the same iterative way Algorithm 2 processes the q_currentupdates inside the while-loop.

Mapping the Transition Alphabet

Assume each element in the alphabet is mapped into the set of DFA_Schematransitions indices that accept the alphabet. We call this index a ‘transition-symbol’ (denoted herein by TS). Formally, assume we have DFA={Q,Σ,δ,q₀,F}. Denote δ_lδ(q_i,a_j)=q_k,l=(i,j,k),a_j∈Σ,q_i,q_k∈Q. We map the input symbol a_jto a new set of symbols denoted by l, which constitute the new alphabet. The collection of symbols l constitute the new alphabet Σ′. The new transition, denoted by δ′_l, is δ_lδ(q_i,l)=q_k,l=Σ′,q_i,q_k∈Q. For a given transition δ_lδ(q_i,a_j)=q_k,l=(i,j,k),a_j∈Σ,q_i,q_k∈Q, then, for l=(i,j,k) the mapping is given by δ′_lδ(q_i,l)=q_k,l∈Σ′,q_i,q_k∈Q. This mapping enables transformation of each DFA to an IA. Then the algorithm in FIG. 10 can be applied. TS provides a more detailed description of the transition assignments. Each symbol represents a transition. This way, the mapping enhances the performance because fewer transitions are used in the context matching. For example, TS 2 in FIG. 16 provides the information needed for matching the context ‘/a/b’. Therefore, we process only δ′₂instead of processing both δ₁, and δ₂.

In order to increase the number of redundant symbols, we map the DFA_Schemaalphabet into indices in DFA_Schematransitions. An example of such a mapping is given in FIG. 16 that shows the mapping of the DFA_Schemaalphabet (a,b,c) to the indices of the DFA_Schematransitions (1, 2, 3 and 4). Element a is mapped into TS 1, which is

$δ_{1} \overset{Δ}{=} δ (q_{0}, a) = A,$

and TS 3, which is δ₃δ(B,a)=A. Element b is mapped into TS 2, which is

$δ_{2} \overset{Δ}{=} δ (A, b) = B,$

and element c is mapped into TS 4, which is

$δ_{2} \overset{Δ}{=} δ (A, c) = C .$

Now we explain how to map an XPath-query to transition symbols. For example, in FIG. 16 we look for the XPath-query //a/c. The mapping of this XPath-query assigns the symbol a to TS 1 and to TS 3. The mapping also assigns the symbol c to TS 4. From these two assignments we get two XPath queries: 1) //TS 1/TS 4. 2) //TS 3/TS 4. Formally, each symbol s from the XPath-query expression is assigned the set L_m={l:l=(i,j,k),s=a_j∈Σ, q_i,q_k∈Q} of TSs where m is the number of expressions in the sequence that composes the XPath-query. The XPath-query is assigned to the set of the Cartesian product L_l× . . . ×L_mwhere m is the number of expressions in the XPath-query. In the above example, m=2.

So we have a collection of Cartesian products L_l× . . . ×L_mwhere m is the number of expressions in the XPath-query. Each product is a translated XPath-query. If a symbol is redundant in all the valid XPath queries then the symbol is removed.

FIG. 17 shows the DFA_Schemaand the XPath-query of FIG. 11 after having been mapped to transition symbols. The transition symbol of each element in parentheses is to the left of the element.

FIGS. 18
a-18e show the reduction process of the DFA_Schemain FIG. 17. FIG. 18a shows the DFA_Schema. FIG. 18b shows the DFA_Schemaafter the removal of TS 5, 8, 9 and 10. We see that TS 3 is a necessary-TS because TS 3 creates an external-single alternate-sequence. In FIG. 18c, TS 1 is reduced. In FIG. 18d, TS 7 is reduced. Finally, in FIG. 18e, TS 2 is reduced. After the reduction, TS 6 creates a double-external alternate-sequence. TS 3 is still a necessary-TS but now TS 3 creates a double-external alternate-sequence.

In FIG. 18e, the reduction example is terminated by a DFA with three TSs. The original example in FIG. 11 is terminated by a DFA with six transitions. The reduction after the mapping reduces the number of transitions by factor of two.

The online algorithm translates the input symbols of L_rootinto TSs. We use DPDT from Averbuch et al. '307 to translate the symbols. We replace the start-element and the end-element procedures in FIG. 13 with new procedures that are called Start-TS and End-TS. Start-TS and End-TS accept as an input a TS instead of an element. The TS is extracted from our XML-parser DPDT automata (see Averbuch et al. '307).

The DFA_Schemathat is constructed from DPDT contains δ(q_i,a_j)=q_jsuch that /q_i,q_j∈Q,a_j∈Σ, and a_jalways enters q_j. The TS of this DFA_Schemahas the form {l:l=(i,j,j), s=a_j∈Σ,q_i,q_j∈Q}. DPDT is defined as follows:

$M = (Q, \sum ⋃ {$}, Γ, Δ, δ, q_{00}, Z_{0}, {f_{0}})$

$where$

$Q = \underset{i = 0}{⋃^{n}} Q_{i}$

$\sum = {a_{1}, {\overline{a}}_{1}, a_{2}, {\overline{a}}_{2}, \dots, a_{n}, {\overline{a}}_{n}} ⋃ \sum^{'} Γ = {Z_{0}} ⋃ {[q, a_{j}] \langle q \in Q_{i}, 0 \leq j \leq n}$

For each i in the Q_iin M there exists a unique q_iin the constructed DFA_Schema. From the top of the stack [q, a_j] we get the previous and the current states of the DFA_Schema. The previous state is the unique q_ithat is constructed from the states Q_i, q∈Q_i, and the current state q_curis q_jthat accepts a_j. The new symbol scan be one of the following:

- 1. If s=ā_jthen the End-TS procedure is called with TS (i, j, j). The transition-symbol from Q_ito Q_jis not needed.
- 2. If s=a_lthen the Start-TS procedure is called with TS (j,l,l). The transition-symbol from Q_jto Q_lis needed.
- 3. If s=Σ′ then the procedure in the XPath is not applied because q_curremains in the same Q_j.

Pseudo code that describes the modifications of the DPDT algorithm and the adaptation of the DPDT algorithm to processing TSs is given in FIG. 19. In FIG. 19, the modified DPDT is denoted by “Modified-DPDT”.

The System That Implements the Extended Algorithm

In the basic algorithm, a semistructured query states a pattern of semistructured model entities that is called a “context”. The XML standard allows a query to have more than one context. The context is arranged in a tree of contexts. The XML standard allows each context to include a Boolean expression that is calculated on the textual value of the matched node in the tree. The Boolean expression is written as a textual string. Therefore, this Boolean expression is called a “text expression” in this section.

FIG. 20 shows how to extend the basic algorithm of the present invention to support many concurrent XPath-queries. The flow of the core components for XML processing system is given in FIG. 20. We start the top-down description of the flow from the input of the XML Schema. The system receives the XML schema as an input (denoted by a in FIG. 20). The XML parser-generator (denoted by 2 in FIG. 20), which is described in Averbuch et al. '307, generates a parser table with the XML symbols syntax (denoted by e in FIG. 20) for the XML-parser of this schema (denoted by 7 in FIG. 20). As a byproduct, the XML parser generator also produces the DFA_Schema(denoted by m in FIG. 20).

In addition, the system receives also a XPath-query as an input (denoted by b in FIG. 20). Then, the system translates the XPath-query (denoted by 3 in FIG. 20) into a query that fits streaming. The system creates, from the query that fits streaming, a DFA_query(denoted by f in FIG. 20) that is given to the XPath-uniting algorithm (denoted by 6 in FIG. 20).

The XPath-uniting adds the DFA_queryto cluster C^k. The DFA_queryof a cluster C^k,k=1, . . . , K, which is denoted DFA_query^C^k(denoted by n in FIG. 20), is given to the DFA reduction process (denoted by 5 in FIG. 20). K is the number of clusters.

For DFA_query^C^k, the DFA reduction process constructs the DFA_min_XPathfrom the DFA_query^C^kThis DFA_min_XPathis denoted DFA_min_XPath^C^k(denoted by h in FIG. 20). The DFA reduction process outputs the DFA_min_XPath^C^kto the XML parser (denoted by 8 in FIG. 20) that processes the XPath-queries to find matched context.

The system receives streams of XML data as an input (denoted by c in FIG. 20). The XML stream is validated by the XML parser (denoted by 7 in FIG. 20) that is constructed from the generated parsing table (denoted by e in FIG. 20). The parser's symbols (denoted by j in FIG. 20) are the input to the XML parser (denoted by 8 in FIG. 20) that processes the XPath-queries to detect matched contexts.

The matched text expression is a Boolean expression represented by a string that is a part of the XPath query. This Boolean expression is applied on the textual value of the element that is matched by the query context (box 6). This text expression (denoted by k in FIG. 20) is the input to the matched text module (denoted by 9 in FIG. 20) that calculates the XPath Boolean expression on the matched texts. The output of the matched text module is the XPath-query result (denoted by d in FIG. 20). If needed, 8 in FIGS. 20 and 9 in FIG. 20 can be duplicated to run in parallel (concurrent) mode.

When the XML data does not have a schema, the system provides a mechanism to build a schema from the XML stream. The statistics of XML symbols occurrences is gathered (denoted by 4 in FIG. 20). The symbols statistics (denoted by g in FIG. 20) are input to the schema builder (denoted by 1 in FIG. 20) that constructs a schema for the XML stream. The symbols statistics (g) are also input to DFA reduction module (5) that can order the sequence of the removal of the symbols according to their sizes in the stream.

The extended algorithm (FIG. 20) is divided into two sequential parts:

- 1. Offline—constructs a DFA_min_XPath^C^kwith minimal alphabet.
- 2. Online—uses the DFA_min_XPath^C^kfrom the Offline part to provide an answer to several concurrent XPath-queries in an XML stream.

a Description of each operational module in the flowchart of FIG. 20 is given in table 2.

TABLE 2

Box #
Functionality description of the box

1
Constructs a scheme from a stream of XML symbols

2
Averbuch et al. ‘307 - “XML Parser”

3
See C. Bry and S. Schaffert, Towards a declarative query and transformation

language for XML and semistructured data: simulation unification, Research

Report PMS-FB-2002-2, Computer Science Institute, Munich, Germany,

February 2002. Translates queries syntax to fit XML streaming processing

4
Constructs two hierarchy levels of XML symbols

5
Basic algorithm of the present invention

6
Unites different DFAs according to similarities between DFAs symbols

7
Averbuch et al. ‘307 - “XML Parser”

8
Averbuch et al. ‘307 - “XML Parser”

9
“Rete” type matching. C. Forgy, “Rete: A Fast Algorithm for the Many

Pattern/Many Object Pattern Match Problem”, Artificial Intelligence, vol. 19, pp

17–37, 1982

Uniting of Queries in the Extended Algorithm

Streaming dictates the need to process concurrently a large number of XPath-queries. Therefore, the basic algorithm is extended to fit steaming requirements. This extension is achieved by the module that unites similar DFA_querys to be processed together. The input for the unite operation (denoted by 6 in FIG. 20) is a DFA_query(denoted by f in FIG. 20). Pseudocode for the uniting algorithm is given in Algorithm 3. In this algorithm, the new DFA_queryis added to a union of existing DFA_query^C^kwhich have a “close” alphabet. How is this new DFA_queryadded? The uniting component (denoted by 6 in FIG. 20) creates a new DFA_query^C^k(denoted by n in FIG. 20) that contains the new and the original DFA_querys. The new DFA_query^C^kis given as an input to the DFA reduction module (denoted by 5 in FIG. 20).

The following pseudocode (“Algorithm 3”) is pseudocode for the uniting algorithm of module 6 of FIG. 20).

\begin{matrix} Inputs : {DFA}_{query} in time t, denoted by {DFA}_{{query}^{t}} = (\sum_{q^{t}}, Q_{q^{t}}, \\ δ_{q^{t}}, S_{q^{t}}, F_{q^{t}}) \end{matrix}

Output: C^j, j = 1, . . . , K

From the processing before time t:

\begin{matrix} q^{t - 1} = \underset{k = 1}{⋃^{K}} C^{k}, C^{k} = ⋃ q^{l}, l \leq t - 1 clusters of {DFA}_{query} before the \\ current time t \\ {DFA}_{query}^{C^{k}} = (\sum_{C_{k}}, Q_{C_{k}}, δ_{C_{k}}, S_{C_{k}}, F_{C_{k}}), k = 1, \dots, K \end{matrix}

Procedure:

\begin{matrix} Choose specific j, 1 \leq j \leq K, such that \sum_{C^{j}} from {DFA}_{query}^{C^{j}} \\ is the “ closet ” to \sum_{q^{t}} from {DFA}_{{query}^{t}} \end{matrix}

C^j← C^j∪DFA_query′

End procedure

Implementation

FIG. 21 is a partial high-level block diagram of a system 100 for implementing the present invention. The major components of system 100 that are illustrated in FIG. 21 are a processor 102, a random access memory (RAM) 104, a non-volatile memory (NVM) 106 such as a hard disk or a flash memory, and a network interface 108. Processor 102, RAM 104, NVM 106 and network interface 108 communicate with each other via a common bus 110. Optionally, system 100 also includes input and output devices in addition to network interface 108, for example a compact disk drive, a USB port, a monitor, a keyboard and/or a mouse, that also communicate via bus 110.

NVM 106 has embodied thereon source code for a message broker of the present invention. Specifically, NVM 106 has embodied thereon source code 112 for implementing the basic method of the present invention as illustrated in FIG. 3 or the extended method of the present invention as illustrated in FIG. 20. The source code is coded in a suitable high-level language. Selecting a suitable high-level language is easily done by one ordinarily skilled in the art. The language selected should be compatible with the hardware of system 100, including processor 102, and with the operating system of system 100. Examples of suitable languages include but are not limited to compiled languages such as FORTRAN, C and C++, and non-compiled languages such as JAVA. NVM 106 is an example of a computer readable storage medium on which is embodied program code of the present invention.

If source code 112 must be compiled to produce executable machine code, processor 102 compiles source code 112 to produce corresponding executable machine code 114 that is stored in RAM 104. If source code 112 does not need to be compiled in order to be executed, source code 112 is copied from NVM 106 to RAM 104 for execution. System 100 is coupled to a network (not shown) by network interface 108. The network could be as small as a two-computer LAN or as large as the worldwide Internet. System 100 could function on the network as a client, a server, a router, a switch, a hub or a gateway. The client may be a portable device such as a smart card, a cellular telephone or a palm pilot. The client may be a RFID tag reader. The server may be a database server for answering queries from clients about XML data in a database; the database itself may be either native or RDBMS or ORDBMS (Object Relational DBMS) or OODBMS (object oriented DBMS). The gateway may function as a XML proxy. XML data to be queried, and optionally the associated schema (“optionally” because source code 112 includes source code for constructing the schema from the data), are received from the network via network interface 108. Processor 102 executes machine code 114 to query the XML data.

Alternatively, rather than store source code for a message broker of the present invention in NVM 106, system 100 downloads executable code from a different node on the network, via network interface 108.

If system 100 is used to query a database then typically the database is stored in NVM 112.

FIG. 22 is a partial high-level block diagram of another system 120 for implementing the present invention. The major components of system 120 that are illustrated in FIG. 22 are a processor 122, a read-only memory (ROM) 124 and a network interface 108. Processor 122, ROM 124 and network interface 128 communicate with each other via a common bus 130.

ROM 124 has embodied thereon executable machine code for a message broker of the present invention. Specifically, ROM 124 has embodied thereon machine code 134 for implementing the basic method of the present invention as illustrated in FIG. 3 or the extended method of the present invention as illustrated in FIG. 20.

System 120 is coupled to a network (not shown) by network interface 128. As in the case of system 100, the network could be as small as a two-computer LAN or as large as the worldwide Internet; and system 120 could function on the network as a client, a server, a router, a switch, a hub, or a gateway, as discussed above in the context of system 100. XML data to be queried, and optionally the associated schema, are received from the network via network interface 128. Processor 122 executes machine code 134 to query the XML data.

FIG. 23 is a partial high-level block diagram of a hardware implementation of the present invention, specifically a PCI card 200. The major components of PCI card 200 that are illustrated in FIG. 23 are a standard 47-pin PCI interface 202, eight dedicated processors 206, 208, 210, 214, 216, 218, 220 and 222, and a RAM 224, all communicating with each other via a local bus 204. Dedicated processors 206, 208, 210, 214, 216, 218, 220 and 222 are, for example, application-specific integrated circuits (ASICs) or field programmable gate arrays (FPGAs). Dedicated processor 206 is a schema constructor that implements the XML statistics gathering of block (4) of FIG. 20 and the schema building of block (1) of FIG. 20. Dedicated processor 208 is a schema automaton constructor that implements the DFA_Schemaconstruction and the XML parser generation of block (2) of FIG. 20. Dedicated processor 210 is a query automaton constructor that implements the DFA_queryconstruction of block (3) of FIG. 20. Dedicated processor 214 is an answer automaton constructor that implements the automaton reduction of block (5) of FIG. 20. Dedicated processor 216 is a query automaton unite engine that implements the query uniting of block (6) of FIG. 20. Dedicated processor 218 is a parser that implements the data validation of block (7) of FIG. 20. Dedicated processor 220 is an answer automaton engine that implements the query processing of block (8) of FIG. 20. Dedicated processor 222 is a text matcher that implements the text matching of block (9) of FIG. 20.

Plugging PCI card 200 into the PCI bus of a standard personal computer provides that personal computer with a fast, hardware-based implementation of the functionality of the present invention. Those skilled in the art will readily conceive of analogous hardware implementations of the present invention that are suitable for incorporation in, for example, any of the network devices discussed above in the context of system 100.

FIG. 24 is a partial high-level block diagram of another hardware implementation of the present invention, in which the functionality of the present invention is distributed between two devices, an offline device 230 and an online device 240, that communicate with each other via a network 250. Only the components of devices 230 and 240 that are germane to the present invention are shown in FIG. 24. Those skilled in the art will readily understand what other components need to be included in devices 230 and 240 to render devices 230 and 240 fully functional.

Device 230 includes a PCI card 300 that in turn includes a standard 47-pin PCI interface 302, five dedicated processors 306, 308, 310, 314 and 316, and a RAM 324, all communicating with each other via a local bus 304. Dedicated processors 306, 308, 310, 314 and 316 are, for example, ASICs or FPGAs. Dedicated processor 306 is a schema constructor that implements the XML statistics gathering of block (4) of FIG. 20 and the schema building of block (1) of FIG. 20. Dedicated processor 308 is a schema automaton constructor that implements the DFA_Schemaconstruction and the XML parser generation of block (2) of FIG. 20. Dedicated processor 310 is a query automaton constructor that implements the DFA_queryconstruction of block (3) of FIG. 20. Dedicated processor 314 is an answer automaton constructor that implements the automaton reduction of block (5) of FIG. 20. Dedicated processor 316 is a query automaton unite engine that implements the query uniting of block (6) of FIG. 20. Device 230 also includes a network interface 260 for communicating with network 250 and a PCI bus 270 to which both network interface 260 and PCI card 300 are operationally connected.

Device 240 includes a PCI card 400 that in turn includes a standard 47-pin PCI interface 402, three dedicated processors 418, 420 and 422, and a RAM 424, all communicating with each other via a local bus 404. Dedicated processors 418, 420 and 422 are, for example, ASICs or FPGAs. Dedicated processor 418 is a parser that implements the data validation of block (7) of FIG. 20. Dedicated processor 420 is an answer automaton engine that implements the query processing of block (8) of FIG. 20. Dedicated processor 422 is a text matcher that implements the text matching of block (9) of FIG. 20. Device 240 also includes a network interface 280 for communicating with network 250 and a PCI bus 290 to which both network interface 280 and PCI card 400 are operationally connected.

Those skilled in the art will readily conceive of analogous distributed hardware implementations of the present invention that distribute the functionality of the present invention among two or more of any of the network devices discussed above in the context of system 100.

As noted at the beginning of this disclosure, the present invention is primarily intended for the fast querying of an XML data stream. The present invention also is eminently suited to similar applications such as fast querying of non-streaming semistructured data such as a fixed XML database.

While the invention has been described with respect to a limited number of embodiments, it will be appreciated that many variations, modifications and other applications of the invention may be made.

Claims

1. A method of answering a query of semistructured data, comprising the steps of: (a) constructing an answer automaton, based at least in part on the query and on a schema of the data; and(b) applying said answer automaton to the data to answer the query.
2. The method of claim 1, wherein said constructing is effected by steps including: (i) constructing a schema automaton for said schema;(ii) constructing a query automaton for the query; and(iii) merging said schema automaton and said query automaton to provide said answer automaton.
3. The method of claim 2, wherein said merging is effected by forming an intersection of said schema automaton and said query automaton.
4. The method of claim 2, wherein said automata are deterministic finite automata.
5. The method of claim 4, wherein said automata are isostate automata.
6. The method of claim 5, wherein said schema automaton first is constructed as a finite automaton that accepts an alphabet and then said alphabet is mapped into a set of transition indices that accept said alphabet, thereby transforming said finite automaton into an isostate automaton.
7. The method of claim 1, wherein said answer automaton is a deterministic finite automaton.
8. The method of claim 7, wherein said answer automaton is a isostate automaton.
9. The method of claim 1, further comprising the step of: (c) building said schema from the data.
10. The method of claim 1, wherein said applying includes parsing the data, using said answer automaton, to provide a matched context.
11. The method of claim 10, wherein said applying also includes calculating a Boolean expression, that is included in the query, on a textual value of said matched context.
12. The method of claim 10, wherein said constructing is effected by steps including constructing a schema automaton for said schema, using a parser generator that also produces parser tables corresponding to the schema, and wherein said parsing of the data includes using said parser tables to parse the data, thereby producing parser symbols, followed by parsing said parser symbols, using said answer automaton.
13. The method of claim 1, wherein said constructing includes removing redundant symbols from said answer automaton.
14. The method of claim 1, further comprising the steps of: (c) constructing a parsing table for the data, based on said schema; and(d) validating the data, prior to said applying, using said parsing table.
15. A method of answering a plurality of queries of semistructured data, comprising the steps of: (a) constructing an answer automaton, based at least in part on the queries and on a schema of the data; and(b) applying said answer automaton to the data to answer the queries.
16. The method of claim 15, wherein said constructing is effected by steps including: (i) constructing a schema automaton for said schema;(ii) constructing a joint query automaton for the queries; and(iii) merging said schema automaton and said joint query automaton to provide said answer automaton.
17. The method of claim 16, wherein said constructing of said joint query automaton is effected by steps including: (A) for each query, constructing a respective query automaton; and(B) uniting said query automata to provide said joint query automaton.
18. A device for processing semistructured data, comprising: (a) a memory for storing executable code for answering at least one query of the data, said executable code including: (i) executable code for constructing an answer automaton, based at least in part on said at least one query and on a schema of the data, and(ii) executable code for applying said answer automaton to the data to answer said at least one query; and(b) a processor for executing said executable code.
19. The device of claim 18, further comprising: (c) a network interface for receiving the data from a network.
20. A computer-readable storage medium having computer-readable code embodied on said computer-readable storage medium, the computer-readable code for answering at least one query of semistructured data, the computer-readable code comprising: (a) program code for constructing an answer automaton based at least in part on a schema of the data and on the at least one query; and(b) program code for applying said answer automaton to the data to answer said at least one query.
21. A system for answering a query of semistructured data, comprising: (a) a schema automaton constructor for constructing a schema automaton for a schema of the data;(b) a query automaton constructor for constructing a query automaton for the query;(c) an answer automaton constructor for merging said schema automaton and said query automaton to provide an answer automaton; and(d) an answer automaton engine for applying the answer automaton to the data to answer the query.
22. The system of claim 21, further comprising: (e) a schema constructor for constructing said schema from the data.
23. The system of claim 21, wherein said schema automaton constructor includes a parser generator for generating at least one parse table for the data, the system further comprising: (e) a parser for using said at least one parse table to validate the data.
24. The system of claim 21, wherein said answer automaton parses the data to provide a matched context, the system further comprising: (e) a text matcher for calculating a Boolean expression, that is included in the query, on a textual value of said matched context.
25. The system of claim 21, wherein said schema automaton constructor, said query automaton constructor, said answer automaton constructor and said answer automaton engine are implemented in a single common device.
26. The system of claim 21, wherein said schema automaton constructor, said query automaton constructor, said answer automaton constructor and said answer automaton engine are implemented in respective members of a plurality of devices that are operationally coupled by a network.
27. An apparatus for answering a plurality of queries of semistructured data, comprising: (a) a schema automaton constructor for constructing a schema automaton for a schema of the data;(b) a query automaton constructor for constructing respective query automata for the queries;(c) a query automaton unite engine for uniting said query automata to provide a joint query automaton;(d) an answer automaton constructor for merging said schema automaton and said joint query automaton to provide an answer automaton; and(e) an answer automaton engine for applying the answer automaton to the data to answer the queries.
28. The apparatus of claim 27, wherein said schema automaton constructor, said query automaton constructor, said query automaton unite engine, said answer automaton constructor and said answer automaton engine are implemented in a single common device.
29. The apparatus of claim 27, wherein said schema automaton constructor, said query automaton constructor, said query automaton unite engine, said answer automaton constructor and said answer automaton engine are implemented in respective members of a plurality of devices that are operationally coupled by a network.
30. A method of answering a query of semistructured data, comprising the steps of: (a) constructing an answer automaton, based at least in part on the query, said constructing including removing redundant symbols from said answer automaton; and(b) applying said answer automaton to the data to answer the query.
31. A device for processing semistructured data, comprising: (a) a memory for storing executable code for answering a query of the data, said executable code including: (i) executable code for constructing an answer automaton, based at least in part on said query, said constructing including removing redundant symbols from said answer automaton, and(ii) executable code for applying said answer automaton to the data to answer said query; and(b) a processor for executing said executable code.
32. The device of claim 31, further comprising: (c) a network interface for receiving the data from a network.
33. A computer-readable storage medium having computer-readable code embodied on said computer-readable storage medium, the computer-readable code for answering a query of semistructured data, the computer-readable code comprising: (a) program code for constructing an answer automaton, based at least in part on the query, said constructing including removing redundant symbols from said answer automaton; and(b) program code for applying said answer automaton to the data to answer the query.
34. A system for answering a query of semistructured data, comprising: (a) an answer automaton constructor for constructing an answer automaton, based at least in part on the query, said constructing including removing redundant symbols from said answer automaton; and(b) an answer automaton engine for applying said answer automaton to the data to answer the query.
35. A method of answering a query of semistructured data, comprising the steps of: (a) constructing, for the query, a finite query automaton that accepts an alphabet;(b) mapping said alphabet into a set of transition indices of said finite query automaton, thereby transforming said finite query automaton into an isostate query automaton;(c) transforming said isostate query automaton into an answer automaton; and(d) applying said answer automaton to the data to answer the query.
36. A device for processing semistructured data, comprising: (a) a memory for storing executable code for answering a query of the data, said executable code including: (i) executable code for constructing, for said query, a finite query automaton that accepts an alphabet,(ii) executable code for mapping said alphabet into a set of transition indices of said finite query automaton, thereby transforming said finite query automaton into an isostate query automaton,(iii) executable code for transforming said isostate query automaton into an answer automaton, and(iv) executable code for applying said answer automaton to the data to answer said query; and(b) a processor for executing said executable code.
37. The device of claim 36, further comprising: (c) a network interface for receiving the data from a network.
38. A computer-readable storage medium having computer-readable code embodied on said computer-readable storage medium, the computer-readable code for answering a query of semistructured data, the computer-readable code comprising: (a) program code for constructing, for the query, a finite query automaton that accepts an alphabet;(b) program code for mapping said alphabet into a set of transition indices of said finite query automaton, thereby transforming said finite query automaton into an isostate query automaton;(c) program code for transforming said isostate query automaton into an answer automaton; and(d) program code for applying said answer automaton to the data to answer the query.
39. A system for answering a query of semistructured data, comprising: (a) a query automaton constructor for: (i) constructing, for the query, a finite query automaton that accepts an alphabet, and(ii) mapping said alphabet into a set of transition indices of said finite query automaton, thereby transforming said finite query automaton into an isostate query automaton;(b) an answer automaton constructor for transforming said isostate query automaton into an answer automaton; and(c) an answer automaton engine for applying said answer automaton to the data to answer the query.

Fast processing of an XML data stream

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

US Classifications

International Classifications

Abstract

Description

Claims