STRUCTURED DOCUMENT PROCESSING APPARATUS, METHOD, AND PROGRAM

Information

  • Patent Application
  • 20070150493
  • Publication Number
    20070150493
  • Date Filed
    December 06, 2006
    18 years ago
  • Date Published
    June 28, 2007
    17 years ago
Abstract
Statistical information about instance documents and schema information are used to integrate multiple state transitions that enable sectioning of a structure document, thereby generating an optimum automaton. In integrating state transitions, consecutively matching state transitions are held in the form of an ID list, which is then used to count the number of consecutive state transitions. Furthermore, patterns in the number of occurrences of repetitive elements including nested elements are statistically obtained. Variations of blanks in XML are addressed by using a statistical method. Schema information is used to build an automaton beforehand, thereby initialization overhead of the syntax parsing apparatus is reduced.
Description

BRIEF DESCRIPTION OF THE DRAWINGS


FIG. 1 shows increases in average processing time with increasing number of states;



FIG. 2 shows comparison between Deltarser and existing XML parsers in state transition generation overhead;



FIG. 3 is a functional block diagram of a structured document processing apparatus of an embodiment of the present invention;



FIG. 4 illustrates a method for assigning IDs to state transitions and a method for counting repetitive elements;



FIG. 5 shows an optimized automaton;



FIG. 6 shows a loop automaton;



FIG. 7 shows an expanding automaton;



FIG. 8 shows a first XML instance document;



FIG. 9 shows a structure of an automaton of the first XML instance document before optimization;



FIG. 10 shows a flow of processing nested repetitive elements;



FIG. 11 shows a structure of optimized automaton of the first XML instance document 1;



FIG. 12 shows a second XML instance document;



FIG. 13 shows a structure of optimized automaton of the second XML instance document;



FIG. 14 shows classification of optimizations;



FIG. 15 shows an automaton of simple-type elements before optimization;



FIG. 16 shows an optimized automaton of the simple-type elements;



FIG. 17 shows a first exemplary XML Schema;



FIG. 18 shows a compositor;



FIG. 19 shows a second exemplary XML Schema;



FIG. 20 shows an example of state transitions due to blanks;



FIG. 21 shows an XML Schema used in an experiment in a first embodiment;



FIG. 22 shows an XML instance document used in the experiment in the first embodiment;



FIG. 23 shows result of the experiment in the first embodiment; and



FIG. 24 shows a graph of result of the experiment in the first embodiment.


Claims
  • 1) A structured document processing apparatus performing syntax parsing of a structured document in the form of electronic data, comprising: an automaton generating unit which generates a state transition sequence of a plurality of states enabling sectioning of a structured document into a plurality of nodes;an instance document analyzing unit which integrates state transitions in the state transition sequence generated by the automaton generating unit by using statistical information regarding an instance document which is the entity of the structured document and statistically obtains patterns in the number of occurrences of repetitive elements in the state transitions by using the statistical information;a schema information analyzing unit which integrates the state transitions in the state transition sequence generated by the automaton generating unit by using schema information which defines the structure and format of information regarding the structured document; andan automaton optimizing unit which mutually optimizes automatons integrated by the instance document analyzing unit and the schema information analyzing unit.
  • 2) The structured document processing apparatus according to claim 1, wherein the structured document is an XML document.
  • 3) The structured document processing apparatus according to claim 1, wherein the plurality of states enabling sectioning are defined by SAX events.
  • 4) The structured document processing apparatus according to claim 1, further comprising a consecutive state transition counting unit which assigns an ID to each of the state transitions in integration of the state transitions in the instance document analyzing unit, stores consecutively matching state transitions in the form of a list of IDs, and counts the occurrences of the consecutively matching state transitions by using the list of IDs.
  • 5) The structured document processing apparatus according to claim 1, wherein the automaton optimizing unit optimizes the repetitive elements detected by the instance document analyzing unit even if the repetitive elements are nested.
  • 6) The structured document processing apparatus according to claim 1, wherein the instance document analyzing unit fixes a pattern of any number of blank characters appearing between elements in the structured document by using the statistical information.
  • 7) A structured document processing method for performing syntax parsing of a structured document in the form of electronic data, comprising: generating a state transition sequence of a plurality of states enabling sectioning of a structured document into a plurality of nodes;integrating state transitions in the state transition sequence generated at the automaton generating by using statistical information regarding an instance document which is the entity of the structured document and statistically obtaining patterns in the number of occurrences of repetitive elements in the state transitions by using the statistical information;integrating the state transitions in the state transition sequence generated at the automaton generating by using schema information which defines the structure and format of information regarding the structured document; andmutually optimizing automatons integrated at the instance document analyzing and the schema information analyzing.
  • 8) The structured document processing method according to claim 7, wherein the structured document is an XML document.
  • 9) The structured document processing method according to claim 7, wherein the plurality of states enabling sectioning are defined by SAX events.
  • 10) The structured document processing method according to claim 7, further comprising: assigning an ID to each of the state transitions in integration of the multiple state transitions at the statistically obtaining, storing consecutively matching state transitions in the form of a list of IDs, and counting the occurrences of the consecutively matching state transitions by using the list of IDs.
  • 11) The structured document processing method according to claim 7, wherein the automaton optimizing optimizes the repetitive elements even if the repetitive elements are nested.
  • 12) The structured document processing method according to claim 7, wherein a pattern of any number of blank characters appearing between elements in the structured document is fixed by using the statistical information at the statistically obtaining.
  • 13) A computer program for performing syntax parsing of a structured document in the form of electronic data, the computer program causing a computer perform: generating a state transition sequence of a plurality of states enabling sectioning of a structured document into a plurality of nodes;integrating state transitions in the state transition sequence generated at the automaton generating by using statistical information regarding an instance document which is the entity of the structured document and statistically obtaining patterns in the number of occurrences of repetitive elements in the state transitions by using the statistical information;integrating the state transitions in the state transition sequence generated at the automaton generating by using schema information which defines the structure and format of information regarding the structured document; andmutually optimizing automatons integrated at the statistically obtaining and the integrating.
  • 14) The computer program according to claim 13, wherein the structured document is an XML document.
  • 15) The computer program according to claim 13, wherein the plurality of states enabling sectioning are defined by SAX events.
  • 16) The computer program according to claim 13, further comprising: assigning an ID to each of the multiple state transitions in integration of the state transitions at the statistically obtaining, storing consecutively matching state transitions in the form of a list of IDs, and counting the occurrences of the consecutively matching state transitions by using the list of IDs.
  • 17) The computer program according to claim 13, wherein the automaton optimizing optimizes the repetitive elements even if the repetitive elements are nested.
  • 18) The computer program according to claim 13, wherein a pattern of any number of blank characters appearing between elements in the structured document is fixed by using the statistical information at the statistically obtaining.
Priority Claims (1)
Number Date Country Kind
JP2005-374990 Dec 2005 JP national