This application is based upon and claims the benefit of priority from the prior Japanese Patent Application No. 2009-071661, filed on Mar. 24, 2009; the entire contents of which are incorporated herein by reference.
1. Field of the Invention
The present invention relates to a structured document management device and a method.
2. Description of the Related Art
In recent years, various data are managed in a structured document format, such as XML (Extensible Markup Language). There are growing demands for managing various data in the structured document format, which have conventionally been managed by other means, such as formulaic numerical data managed with relational databases, and text data managed with full-text search engines.
Thus, sophisticated queries that designate keywords and structured conditions are being used for structured documents managed in structured document databases, such as XQuery (an XML Query Language) standardized by W3C (World Wide Web Consortium). With this, an increase in search speed is also demanded.
In order to increase the search speed, there is a method that includes dividing a text, which is to be registered in a structured document database, on a keyword basis; associating the divided keyword with document identification information of the structured document to be registered in the structured document database, structure information that indicates the structure of the structured document, and occurrence position information of the keyword in the structured document; and indexing in an inverted file format.
In an index managing system that uses the inverted file, the size of the index generally becomes significantly large. JP-A 2006-172363 (KOKAI) discloses preparing plural compression methods in advance and compressing an index using a compression method that corresponds to a time when registering the index in the inverted file format.
Because time is used as a key to determine the compression method in the art disclosed in JP-A 2006-172363 (KOKAI), the optimal compression method is not always selected.
Generally, a structured document, which is to be registered in a structured document databases, has pieces of data of similar structures that are to be sequentially registered. However, it is often the case that the structure itself of the structure document does not have regularity or relevance. Thus, a high compression rate of the index structure information cannot be expected even when plural compression methods are separately used.
According to one aspect of the present invention, a structured document management device includes a communication unit configured to receive a structured document from a client device via a network; a document analyzing unit configured to analyze a structure of the structured document and extract text information of the structured document; a schema creating unit configured to create first schema information that indicates a structural characteristic of the structured document; a schema storage unit configured to store second schema information created in advance by the schema creating unit and schema identification information for identifying the second schema information in association with each other; a search unit configured to search the schema storage unit for the second schema information that substantially matches the first schema information; an adding unit configured to add, to the structured document, the schema identification information for identifying the second schema information searched by the search unit and identification information for identifying the structured document; an index creating unit configured to divide the text information into words and create index information associated with the schema identification information and the identification information for each of the words divided; an index storage unit configured to store the index information in a file format; an index analyzing unit configured to analyze, when a number of pieces of the index information stored in the file format exceeds a threshold, a distribution of a schema identification information group that includes pieces of the schema identification information equal in number to the pieces of the index information and an identification information group that includes pieces of the identification information equal in number to the pieces of the index information; a first rule storage unit configured to store the schema identification information, group identification information provided for uniquely identifying a group to which the schema identification information belongs and used in compression of the schema identification information group, and in-group identification information provided for uniquely identifying the schema identification information in the group and used in compression of the schema identification information group, in association with each other; a second rule storage unit configured to store a plurality of compression rules; a first compressing unit configure to compress, when a result of distribution analysis of the schema identification information group shows that more than a predetermined number of pieces of the schema identification information match pieces of the schema identification information stored in the first rule storage unit, the schema identification information group using the group identification information and the in-group identification information; and a second compressing unit configured to compress at least one of the schema identification information group and the identification information group using the compression rule stored in the second rule storage unit and determined in accordance with a result of distribution analysis of the index information.
According to another aspect of the present invention, a structured document management method includes receiving a structured document from a client device via a network; analyzing a structure of the structured document and extracting text information of the structured document; creating first schema information that indicates a structural characteristic of the structured document; searching a schema storage unit that stores second schema information created in advance and schema identification information for identifying the second schema information in association with each other for the second schema information that substantially matches the first schema information; adding, to the structured document, the schema identification information for identifying the searched second schema information and identification information for identifying the structured document; dividing the text information into words; creating index information associated with the schema identification information and the identification information for each of the words divided; storing the created index information in a file format in the index storage unit; analyzing, when a number of pieces of the index information stored in the file format exceeds a threshold, a distribution of a schema identification information group that includes pieces of the schema identification information equal in number to the pieces of the index information and an identification information group that includes pieces of the identification information equal in number to the pieces of the index information; compressing, when a result of distribution analysis of the schema identification information group shows that more than a predetermined number of pieces of the schema identification information match pieces of the schema identification information stored in a first rule storage unit having the schema identification information, group identification information provided for uniquely identifying a group to which the schema identification information belongs and used in compression of the schema identification information group, and in-group identification information provided for uniquely identifying the schema identification information in the group and used in compression of the schema identification information group stored in association with each other, the schema identification information group using the group identification information and the in-group identification information; and compressing at least one of the schema identification information group and the identification information group using a compression rule stored in a second rule storage unit that stores a plurality of compression rules and determined in accordance with a result of distribution analysis of the index information.
Hereinafter, with reference to the accompanying drawings, a structured document management device and a method according to an embodiment of the present invention will be described in detail.
First a description is made about the configuration of a structured document management device according to this embodiment.
A structured document management device 1 illustrated in
As illustrated in
The communication unit 10 is configured to perform data transmission and reception with the client device 2 and receive the structured document and a search request of the structured document sent from the client device 2.
As illustrated in
In XML, a tag is used to describe the document structure and the tag includes a start tag and an end tag. Each of elements of the document structure is sandwiched by the start tag and the end tag in order to describe separation of a character string in the document and to which element the character string belongs. The start tag encloses an element name by the symbols “<” and “>” and the end tag encloses the element name by the symbols “</” and “>”. The start tag may have attribute information set therein.
The storage unit 50 is configured to store information used in search processing by the search processing unit 60 and storage processing by the storage processing unit 20; and processing results of the storage processing by the storage processing unit 20. This storage unit 50 may be a HDD (Hard Disk Drive), an optical disk, a memory card, a RAM (Random Access Memory) or another conventional storage medium. The storage unit 50 includes a schema storage unit 51, a structured document storage unit 52, an index list storage unit 53, an index storage unit 54, a first rule storage unit 55 and a second rule storage unit 56. These are described in detail later.
The storage processing unit 20 is configured to analyze the structured document received from the communication unit 10 and create index information used in referring to (searching) the structured document. The storage processing unit 20 includes a document analyzing unit 22, a schema processing unit 24, an adding unit 30, an index creating unit 32 and a compressing unit 34.
The document analyzing unit 22 is configured to analyze a text of the structured document received from the communication unit 10, develop the structured document in an object tree format such as DOM (Document Object Model) and extract text information from the structured document.
The schema processing unit 24 is configured to create schema information that indicates structural characteristics of the structured document developed in the object tree format by the document analyzing unit 22 and determine schema identification information for identifying the created schema information. The schema processing unit 24 includes a schema creating unit 26 and a search unit 28.
The schema creating unit 26 is configured to scan the structured document developed in the object tree format by the document analyzing unit 22, consolidate overlapping elements of the same path, and create first schema information that indicates a structural characteristic of the structured document.
As illustrated in an example of
The schema storage unit 51 stores second schema information created in advance by the schema creating unit 26 and schema identification information for identifying the second schema information in association with each other.
As illustrated in an example of
The search unit 28 is configured to search the schema storage unit 51 for the second schema information that matches the first schema information created by the schema creating unit 26. Specifically, the search unit 28 searches whether elements (nodes in the respective object trees) of the both schema information match with each other. Then, the search unit 28 determines the schema identification information associated with the searched second schema information as the schema identification information of the first schema information created by the schema creating unit 26.
When a search results shows that the first schema information created by the schema creating unit 26 does not match any of the second schema information stored in the schema storage unit 51, the first schema information created by the schema creating unit 26 is associated with new schema identification information and the schema storage unit 51 is updated.
The adding unit 30 is configured to add the schema identification information associated with the second schema information searched by the search unit 28 and the identification information related to the structured document to the structured document developed in the object tree format by the document analyzing unit 22 and have the structured document stored in the structured document storage unit 52.
In this embodiment, the adding unit 30 adds, as the identification information related to the structured document, document identification information for uniquely identifying the structured document in the structured document management device 1 and element identification information for uniquely identifying the position of each element of the structured document in the structured document.
The index creating unit 32 is configured to create an index list by dividing the text information extracted by the document analyzing unit 22 into words and have the index list stored in the index list storage unit 53. Also, the index creating unit 32 creates for each divided word the index information, in which the schema identification information added to the structured document by the adding unit 30 and identification information are associated, and has the index information stored in the inverted file format in the index storage unit 54. Incidentally, the dividing into words may be performed by a conventional word dividing method such as N gram or morphological analysis.
In the present embodiment, the identification information associated with the index information includes, in addition to the document identification information and the element identification information given to the structured document by the adding unit 30, occurrence position information (offset) that indicates the position of the word occurring in the elements of the structured document.
As illustrated in an example of
The inverted file has plural pages as illustrated in
As illustrated in
The first rule storage unit 55 is configured to store the schema identification information, group identification information provided for uniquely identifying a group to which the schema identification information belongs and used in compression of the schema identification information group, and in-group identification information provided for uniquely identifying the schema identification information in the group and used in compression of the schema identification information group, in association with each other.
As illustrated in an example of
The group identification information associated with the schema identification information stored in the first rule storage unit 55 is determined in accordance with the number of searches within a predetermined time period for the schema identification information stored in the schema storage unit 51 by the search unit 28.
The second rule storage unit 56 stores a plurality of compression rules. As illustrated in an example of
The compressing unit 34 is configured to, when a predetermined condition is met, compress the index information stored in the index storage unit 54 using the compression rule stored in the first rule storage unit 55 and the compression rule stored in the second rule storage unit 56. The compressing unit 34 includes an index analyzing unit 36, a first compressing unit 38, a second compressing unit 40 and a change unit 42.
The index analyzing unit 36 is configured to, when the number of index information pieces stored in the index storage unit 54 in the inverted file format excesses a threshold, analyze each distribution of the schema identification information group including a mentioned-number of schema identification information pieces and the identification information group including a mentioned-number of identification information pieces (document identification information group, element identification information group and occurrence position information group). In this embodiment, the number of index information pieces covering one page of the inverted file is set to the threshold.
Specifically, the index analyzing unit 36 determines whether the compression rule stored in the first rule storage unit 55 can be used to compress the schema identification information group. In this embodiment, when the distribution analysis result of the schema identification information group shows more than a predetermined number of schema identification information pieces match the schema identification information stored in the first rule storage unit 55, the index analyzing unit 36 determines that compression can be made.
Similarly, the index analyzing unit 36 determines whether the plural compression rules stored in the second rule storage unit 56 can be used to compress the schema identification information group, the document identification information group, the element identification information group and the occurrence position information group. In this embodiment, the index analyzing unit 36 basically performs an inspection whether each of the compression rules is applicable, starting with the top of the compression rules.
The first compressing unit 38 is configured to, when the distribution analysis result of the schema identification information group by the index analyzing unit 36 shows that more than a predetermined number of schema identification pieces match the schema identification information stored in the first rule storage unit 55, compress the schema identification information group using the group identification information and the in-group identification information.
As illustrated in examples of
The second compressing unit 40 is configured to compress at least one of the schema identification information group and the identification information group (document identification information group, element identification information group and occurrence position information group) using the compression rule stored in the second rule storage unit 56 and obtained in accordance with the distribution analysis result of the index information by the index analyzing unit 36.
In examples of
The second compressing unit 40 is configured to, when there is a plurality of compression rules in accordance with the distribution analysis results of the index information, compress at least one of the schema identification information group and the identification information group using the plural compression rules. In this case, the compression rules are applied in an order of ranking.
The change unit 42 is configured to rearrange the index information compressed by the first compressing unit 38 and the second compressing unit 40 for each of the schema identification information and identification information (document identification information group, element identification information group and occurrence position information group) and store them in the index storage unit 54.
In examples of
In addition, the change unit 42 is configured to store the applied compression rule as the header information of the index information in rearrangement. This makes it possible to recover the index information and to handle a specific information group in searching by the later-described search processing unit 60.
The search processing unit 60 is configured to search the structured document managed in the structured document management device 1. Specifically, the search processing unit 60 decodes the index information with reference to the header information of the index information stored in the index storage unit 54 and uses the decoded index information to search for the structured document stored in the structured document storage unit 52. The search processing unit 60 may be configured to retrieve only a specified information group from the index information to increase the index scanning speed.
Next, a description is made about the operation of the structured document management device according to the present embodiment.
As illustrated in
Then, the document analyzing unit 22 analyzes the syntax of the structured document received from the communication unit 10, develops the structured document into the object tree format such as DOM and extracts text information from the structured document (Step S12).
Next, the schema creating unit 26 scans the structured document developed into the object tree format by the document analyzing unit 22, consolidates overlapping elements in the same path and creates the first schema information that indicates a structural characteristic of the structured document (Step S14).
Then, the search unit 28 searches the schema storage unit 51 for the second schema information that matches the first schema information created by the schema creating unit 26 (Step S16).
When a search result shows that fundamental structures (elements) of both of the schema information match (Yes at Step S18), the search unit 28 determines the schema identification information associated with the searched second schema information as the schema identification information of the first schema information created by the schema creating unit 26.
When the search result shows that fundamental structures (elements) of both of the schema information does not match (No at Step S18), the search unit 28 associates a new schema identification information with the first schema information created by the schema creating unit 26 and stores the schema identification information in the schema storage unit 51 (Step S20).
Then, the adding unit 30 adds, to the structured document developed in the object tree format by the document analyzing unit 22, the schema identification information associated with the second schema information searched by the search unit 28 and the identification information relating to the structured document and stores the structured document in the structured document storage unit 52 (Step S22).
Next, the index creating unit 32 creates the index list by dividing the text information extracted by the document analyzing unit 22 into words and stores (updates) the index list in the index list storage unit 53 (Step S24).
Then, the index creating unit 32 creates, for each divided word, the index information associated with the schema identification information and the identification information added to the structured document by the adding unit 30 and stores the index information in the index storage unit 54 in the inverted file format (Step S26).
Next, the index analyzing unit 36 determines whether the number of index information pieces stored in the inverted file format in the index storage unit 54 exceeds a threshold (Step S28).
When the number of index information pieces exceeds the threshold (Yes at Step S28), the index analyzing unit 36 analyzes the distribution of each the an identification information group including the identification information pieces equal in number to the index information pieces and the schema identification information group including the schema identification information pieces equal in number to the index information pieces (Step S30).
When the number of index information pieces does not exceed the threshold (No at Step S28), Step S30 and its following steps are not performed and the processing ends.
Next, when a result of the distribution analysis of the schema identification information group by the index analyzing unit 36 shows that more than a predetermined number of schema identification information pieces match the schema identification information pieces stored in the first rule storage unit 55 (Yes at Step S32), the first compressing unit 38 compresses the schema identification information group using the group identification information and the in-group identification information (Step S34).
When the result of the distribution analysis of the schema identification information group by the index analyzing unit 36 shows that more than a predetermined number of schema identification information pieces do not match the schema identification information pieces stored in the first rule storage unit 55 (No at Step S32), the first compressing unit 38 does not perform processing of Step S34.
Then, the second compressing unit 40 compresses at least one of the schema identification information group and the identification information group using the compression rule stored in the second rule storage unit 56 and corresponding to the result of the distribution analysis of the index information by the index analyzing unit 36 (Step S36).
Next, the change unit 42 rearranges the index information pieces compressed by the second compressing unit 40 and the first compressing unit 38 for the schema identification information and identification information group and stores them in the index storage unit 54 (Step S38).
Next, a description is made about a specific example of compression by the structured document management device according to the present embodiment.
In an example illustrated in
In an example illustrated in
As illustrated in
In an example illustrated in
Furthermore, in an example illustrated in
As described so far, according to this embodiment, the schema identification information, the group identification information capable of identifying a group to which the schema identification information belongs, and the in-group identification information capable of identifying the schema identification information in the group are stored in association with each other. Thus, the schema identification information group can be compressed using the group identification information and the in-group identification information, and even an index that includes the schema identification information as the structure information of the structured document can be compressed appropriately.
Furthermore, according to the present embodiment, not only the schema identification information group, but also the document identification information group, the element identification information group and the occurrence position information group are compressed using the compression rule corresponding to a result of distribution analysis of the index information, so that the index information can be compressed more effectively.
The structured document management device 1 and the client device 2 in the above-described embodiment have hardware configurations each having a controller such as a CPU (Central Processing Unit), a storage device such as a ROM (Read Only memory) and a RAM (Random Access Memory), a display device such as a liquid crystal display, an input device such as a keyboard and a mouse, a communication I/F for connecting to the network for communication and the like.
The present invention is not limited to the above-described embodiment and may be embodied in various forms by modifying any elements without departing from the scope of the invention. Furthermore, plural elements disclosed in the above-described embodiments are combined appropriately into various inventions. For example, any of all of the elements disclosed in the above-described embodiments may be removed. Also, elements of various embodiments may be combined appropriately.
Additional advantages and modifications will readily occur to those skilled in the art. Therefore, the invention in its broader aspects is not limited to the specific details and representative embodiments shown and described herein. Accordingly, various modifications may be made without departing from the spirit or scope of the general inventive concept as defined by the appended claims and their equivalents.
Number | Date | Country | Kind |
---|---|---|---|
2009-071661 | Mar 2009 | JP | national |