This application is based upon and claims the benefit of priority from prior Japanese Patent Application No. 2003-430598, filed Dec. 25, 2003, the entire contents of which are incorporated herein by reference.
Various methods of a structured data management system that stores and retrieves structured documents described in an Extensible markup language (XML) and the like have been proposed.
(1) A method of storing structured data intact as a text file as a simple method. With this method, when the number of data and the data size increase, the storage efficiency impairs, and a retrieval process that exploits the features of structured documents becomes harder to achieve.
(2) A method of managing structured document data in an RDB (Relational Database).
(3) A method of managing structured document data using an OODB (Object Oriented Database) which has been developed to manage the structured document data. Backbone systems prevalently use the RDB, and for example, an XML compatible RDB that expands the RDB is commercially available as a product. Since the RDB stores data in a flat table format, complicated mapping is required to determine correspondence between the hierarchical structure such as XML data and the like, and the table. If prior schema design for this mapping is insufficient, performance drop may occur.
In recent years, a new method has been proposed in addition to these methods (1) to (3).
(4) A method of natively managing structured document data. This method stores XML data having various hierarchical structures without any special mapping process. For this reason, no special overhead is required upon storage or acquisition. Also, the need for prior schema design that requires high cost can be obviated, and the XML data structure can be freely changed as needed in correspondence with a change in business environment.
Even if structured document data are efficiently stored, there is no point if no means for extracting stored data is available. As such means for extracting stored data, query languages are used. XQuery (XML Query Language) has been designed for XML as in SQL (Structured Query Language) for RDB. XQuery is a language used to handle XML data like a database. For this purpose, means for extracting a data set that matches a condition, and means for compiling and parsing data are provided. Also, since XML data have a hierarchical structure as a combination of parent elements, child elements, brother elements, and the like, means for tracing such hierarchical structure is provided.
A technique for retrieving structured document data that includes a specific element and specific structure designated by a retrieval condition while tracing the hierarchical structure of the stored structured document data has already been proposed (e.g., Jpn. Pat. Appln. KOKAI Publication Nos. 2002-34618 and 2000-57163).
As the structure of the structured document data has a larger scale, the number of structured document data stored in a database is larger, and a retrieval condition becomes more complicated, a longer time is required to trace elements which form the hierarchical structure of each structured document data. Also, it is impossible to expand stored structured document data onto a memory with increasing number of structured document data and their sizes, and most of structured document data are stored in a secondary storage such as a hard disk or the like.
In the method of natively managing structured document data, the hierarchical structure among elements of the structured document data is stored intact. In order to check if an element or structure designated as a retrieval condition is included, elements of structured document data stored on the secondary storage must be frequently accessed. Still more accesses are required for a complicated retrieval condition.
Conventionally, in order to retrieve structured document data having a desired element or structure from a database that stores structured document data with the hierarchical structure, a high-speed retrieval process cannot be attained since structured document data having an element or structure designated by the retrieval condition is retrieved while tracing element which form the hierarchical structure of each structured document data in the database. Especially, it becomes more difficult to attain a high-speed retrieval process with increasing size of structured document data and increasing number of structured document data to be retrieved.
According to embodiments of the present invention, there is provided a structured data retrieval method and apparatus, which can attain a high-speed retrieval process of structured document data.
A structured data retrieval apparatus stores, in a first memory, a plurality of template IDs used to identify locations of a plurality of structure elements included in a hierarchical structure; stores, in a second memory, a plurality of structured data items each of which includes a plurality of elements, each of the elements being assigned a template ID of one of the structure elements; inputs a retrieval condition which designates a first structure element of the structure elements, and a character string included in the first structure element; retrieves, from the structured data items, a structured data item including a first element which includes the character string and is assigned a first template ID of the first structure element; and outputs the structured data item retrieved.
A structured data retrieval apparatus stores, in a first memory, a plurality of template IDs used to identify locations of a plurality of structure elements included in a hierarchical structure; stores in a second memory, a plurality of structured data items each of which includes a plurality of elements, each of the elements being assigned a template ID of one of the structure elements; inputs a retrieval condition which designates a first structure element which is one of the structure elements, a second structure element which is another of the structure elements and includes the first structure element, and a character string included in the first structure element; retrieves, from the structured data items, a structured data item including a first element which includes the character string and is assigned a first template ID of the first structure element, and a second element which includes the first element and is assigned a second template ID of the second structure element; and outputs the structured data retrieved.
A structured data retrieval apparatus stores, in a first memory, a plurality of template IDs used to identify locations of a plurality of structure elements included in a hierarchical structure; stores, in a second memory, a plurality of structured data items, each of which includes a plurality of elements each of the elements being assigned a template ID of one of the structure elements; inputs a retrieval condition which designates a first structure element which is one of the structure elements, a second structure element which is one of the structure elements and includes the first structure element, a third element which is one of the structure elements and includes the first and second structure elements, and a character string included in the first structure element; retrieves, from the structured data items, a structured data item including a first element which includes the character string and is assigned a first template ID of the first structure element, a second element which includes the first element and is assigned a second template ID of the second structure element, and a third element which includes the first and second element and is assigned a third template ID of the third structure element; and outputs the structured data item retrieved.
Preferred embodiments of the present invention will be described hereinafter with reference to the accompanying drawings.
In this example, a root element of elements is bounded by <book> tags. This “book” element includes three child elements bounded by <title>, <authors>, and <abstract> tags. The “authors” element includes two child elements having <author> tags. Each “author” element includes child elements bounded by <first> and <last> tags. The “first” and “last” elements respectively have text elements “Taro”, “Tanaka”, and the like.
The client 201 mainly comprises a structured document registration unit 202, retrieval unit 203, input unit 204, and display unit 205. The input unit 204 comprises input devices such as a keyboard, mouse, and the like, and is used to input a structured document and various instructions. The structured document registration unit 202 registers a structured document input from the input unit 204 and that which is pre-stored in a storage device or the like of the client 201 in a structured document database (structured document DB) 111. The structured document registration unit 202 transmits a storing request to the server 101 together with a structured document to be registered.
The retrieval unit 203 generates query data which describes a retrieval condition and the like used to retrieve desired data from the structured document database 111 in accordance with an instruction input by the user from the input unit 204, and transmits a retrieval request including the query data to the server 101. Also, the retrieval unit 203 receives retrieval result data corresponding to the transmitted retrieval request from the server 101, and displays it on the display unit 205.
The server 101 comprises a request processing unit 102, storing processing unit 103, and retrieval processing unit 104. Also, the structured document database 111 is connected to the server 101. The structured document database 111 comprises a structured document data storage unit 112, structure template storage unit 113, and index data storage unit 114.
The request processing unit 102 discriminates the storing request and retrieval request transmitted from the client 201, and distributes processes to the storing processing unit 103, retrieval processing unit 104, and the like. Also, the request processing unit 102 returns the processing results of the storing processing unit 103 and retrieval processing unit 104 to the client 201.
The storing processing unit 103 executes a process for storing a structured document transmitted from the client 201 in response to the storing request received from the client 201. The storing processing unit 103 comprises a document parsing unit 31, document structure extraction unit 32, document structure collation unit 33, and document storing unit 34.
The document parsing unit 31 parses a structured document passed from the request processing unit 102, and the document structure extraction unit 32 extracts the (document) structure of the structured document on the basis of the parsing result. The document structure collation unit 33 collates the extracted structure with structure templates stored in the structured document database 111. The document storing unit 34 stores data of the structured document in the structured document data storage unit 112 of the structured document database 111 on the basis of the collation result of the document structure collation unit 33, and stores index data in the index data storage unit 114.
The retrieval processing unit 104 executes a process for retrieving data that matches the designated condition (query data) from the structured document database 111 upon reception of the retrieval request from the client 201, and returning the retrieved data as retrieval result data. The retrieval processing unit 104 comprises a query parsing unit 41, query structure extraction unit 42, query structure collation unit 43, and query execution unit 44.
The query parsing unit 41 parses query data passed from the request processing unit 102, and the query structure extraction unit 42 extracts the structure of that query data on the basis of the parsing result. The query structure collation unit 43 collates the extracted structure with structure templates stored in the structured document database 111. The query execution unit 44 accesses structured document data, structure templates, and lexical index data stored in the structured document database 111 on the basis of the collation result of the query structure collation unit 43, and generates retrieval result data which matches the condition described in the query data.
Programs, which respectively implement the functions of the request processing unit 102, storing processing unit 103, and retrieval processing unit 104 in
The following description will be given with reference to
The storage method of a structured document in the structured document DB 111 will be described first.
The arcs which represent the parent-child relationship among nodes are links among object data, which are stored in the structured document data storage unit 112 as an OID sequence indicating an object set of child elements in object data.
There are two nodes, i.e., “bookFolder” and “paperFolder” nodes 302 and 303 under the “root” node 301. There are two “book” nodes 304 and 305 under the “bookFolder” node. The “book” node with the OID “2” stores the structured document data shown in
In this manner, data under the “root” node form one large structured document data which includes elements of a plurality of structured documents. The structured document data shown in
When such hierarchical structure including a plurality of nodes is applied to a directory structure which is prevalently adopted in a versatile OS, these nodes correspond to folders and files in the directory structure. That is, the hierarchical structure shown in
In the following description, the “root” node, “bookFolder” node, and “paperFolder” node will be interpreted as folders, and data under these folders will be interpreted as document files together. For example, in case of
The structured document DB shown in
The structured document data shown in
The index data storage unit stores a lexical table that records a plurality of lexical items, and stores each OIDs of text element that includes and linked with the one of the lexical items in the lexical table. By tracing a link from a given lexical item in the lexical table, the position of appearance of a text element including that lexical item, i.e., the OID, can be obtained.
The structure template storage unit 113 stores structure template data. The structure template storage unit 113 stores structure data extracted from structured document data stored in the structured document data storage unit 112.
In
Nodes (which correspond to folders, files, elements, and text elements) expressed by hexagons of the structure template data shown in
The template ID will be described below. The template ID includes information which indicates the type of node of interest on the structure template, and a number which is used to identify each node among nodes of the same type. The node types are expressed by four letters “F”, “D”, “E”, and “T”. “F” represents a folder, “D” represents a document file, “E” represents an element (which is not a text element), and “T” represents a text element. With the template ID which includes the letter indicating the node type and the following number “x”, the node corresponding to the template ID can be identified which of nodes on the structure template and the type of node.
A node with a template ID “Fx” represents a folder, and is called a folder type structure template node. A node with a template ID “Dx” represents a document, and is called a document type structure template node. A node with a template ID “Ex” represents an element (which is not a text element) in the document, and is called an element type structure template node. A node with a template ID “Tx” represents a text element in the document, and is called a text type structure template node. Note that “x” is a serial integer which is unique to each node of the structure template data.
In the structured document data storage unit 112 according to this embodiment, the OIDs used to identify nodes corresponding to the “root” node 301, “bookFolder” node 302, “paperFolder” node 303 (
The DocID is a unique ID in a data file, which is assigned to a document or folder, and is an identifier of a document file or folder. The ElemID is a unique ID in each document, which is assigned to each element in the document. The TID is the ID of a node in the structure template data, i.e., the template ID, as described above.
For example, the “root” node has a DocID “0”, the “bookFolder” node has a DocID “1”, and the two “book” nodes respectively have DocIDs “2” and “3”. Furthermore, the “paperFolder” node and “paper” node have DocIDs which assume unused values other than “0” to “3” above (not shown in
Respective elements (including text elements) in the “book” document under the “book” node with the DocID “2” are assigned ElemIDs “0” to “15”. With this ElemID, each element in the document can be identified.
Furthermore, respective elements (including text elements) in the “book” document under the “book” node with the DocID “2” are assigned the TIDs of nodes corresponding to these elements in the structure template shown in
In this manner, by checking the OID of an arbitrary element in a document file, that document file including a node with that OID can be identified based on the DocID included in the OID, and the location of the node in the structure template and node type can be identified based on the TID included in the OID.
For example, a text node (text element) “XML database” included in the “book” document 311 in
As described above, according to this embodiment, each element of a structured document stored in the structured document DB 111 is identified by the OID which includes the DocID as the identifier of a folder or file to which that element belongs, the ElemID used to identify that element in the file to which the element belongs, and the TID as the identifier on the structure corresponding to that element.
The difference between index data shown in
The processing operation of the storing processing unit 103 in
The structured document registration unit 202 of the client 201 transmits a structured document to be stored, and a storing request message which includes the OID of a folder as the storage destination of this structured document. Note that OIDp represents the OID of the storage destination folder.
Note that the client 201 can obtain the OID of the storage destination folder as follows. The retrieval unit 203 of the client 201 has a GUI used to display a schematic structure of the structured document DB 111 shown in, e.g.,
The request processing unit 102 of the server receives a storing request message which includes a structured document to be stored and the OIDp of the storage destination folder (step S1). A case will be examined below wherein, for example, the OIDp (<1, 0, F1>) corresponding to the “bookFolder” 302 is designated as the storage destination folder, and the new document is to be stored under this folder.
The structured document to be stored, which is included in the storing request message, is passed to the document parsing unit 31 of the storing processing unit 103, and is parsed. As a result, a hierarchical structure including a plurality of object data of the structured document is obtained, and is expanded on the memory (step S2). More specifically, the document parsing unit 31 has a function corresponding to an XML parser which applies a parsing process to the structured document as XML data to map that data into object data in the DOM (Document Object Model) format.
Furthermore, a new document ID (DocID) is assigned to that new structured document (step S3).
The document structure extraction unit 32 extracts the structure of the structured document, i.e., a plurality of nodes corresponding to elements in the structured document and a structure which includes the plurality of nodes, by tracing the parsing result of the document parsing unit 31 from its root. Let Sc be the structure of the structured document (step S4).
The document structure collation unit 33 acquires a structure from the structure template storage unit 113 using the OIDp of the storage destination folder as a key. For example, if the OIDp is <1, 0, F1>, the unit 33 acquires the TID “F1”. Let TIDp be the TID acquired from this OIDp. The document structure collation unit 33 acquires a corresponding structure by scanning the structure template storage unit 113 using the TIDp as a key (step S5). Let Sp be the acquired structure (step S6).
The document structure collation unit 33 collates Sc and Sp (step S7). This process is implemented by simple matching between trees. That is, if a structure element of Sp corresponding to that of Sc is found, the TID of the structure element of Sp is assigned to that of Sc. If no structure element of Sp corresponding to that of Sc is found, a new TID is assigned to the element which is not included in Sp but is included in Sc, and that new element is added to Sp. Also, the new TID is assigned to the new element of Sc. This operation is repeated for all structure elements of Sc.
The document structure collation unit 33 assigns element IDs (ElemIDs) to respective elements of Sc (step S8). For example, the unit 33 assigns the ElemIDs to respective elements while tracing the structure of Sc downstream from the root node.
With the above process, the OID <DocID, ElemID, TID> is assigned to each element in Sc. For example, the OID of a root object of the structured document to be stored is <DocID, 0, TID>.
Finally, the document storing unit 34 stores the updated Sp in the structure template storage unit 113. In this manner, the structure template stored in the structure template storage unit 113 is updated.
The document storing unit 34 updates the contents of the index data storage unit 114 on the basis of text elements of a plurality of elements which form Sc (step S9 in
Furthermore, the document storing unit 34 acquires an object corresponding to the OIDp given as the storage destination by scanning the structured document data storage unit 112, and adds the OIDs of respective elements of the structured document to be stored to an OID sequence indicating a set of objects of child elements of that object data. More specifically, the structured document to be stored in which the aforementioned OIDs are assigned to the respective elements is stored in the structured document storage unit 112 to be added immediately under the “bookFolder” 302 with the OIDp <1, 0, F1> (step S10 in
The processing operation of the retrieval processing unit 104 in
The query data shown in
The query data shown in
An outline of the processing operation of the retrieval processing unit 104 which has received the query data shown in, e.g.,
The query data received by the request processing unit 102 is passed to the query parsing unit 41 of the retrieval processing unit 104. The query parsing unit 41 parses the received query data (step S101). The query structure extraction unit 42 extracts a graph structure called a query graph from the query data on the basis of the parsing result of the unit 41 (step 102). For example, in case of the query data shown in
The query graph is formed by connecting variables corresponding to element names (e.g., “db “DB””, “book”, “last”) and character strings (e.g., “Tanaka”, “Nakamura”) included in the query data in accordance with the inclusive relationship of the elements and character strings included in the query data, as shown in
The query structure collation unit 43 extracts a structure from the structure template storage unit 113 of the structured document DB 111. Let Sp be the extracted structure. In this case, the structure below the most upstream element of the hierarchical tree of the structured document database, i.e., the “book” element, which is designated in the query data, is extracted. The extracted structure Sp is collated with above Sc. As a result, the TIDs that can be assumed are assigned to respective elements of Sc (step S103).
The query execution unit 44 checks if the condition expressed by the query graph includes an AND condition or an OR condition. Since a process for an AND condition is a fundamental one, and that for an OR condition is a modification of that process, a detailed description of the process for an OR condition will be omitted.
The process for an AND condition generates, in turn, data which represents a combination, called a table, of values that a variable set can assume, for the purpose of embodying all variables included in the query graph. Note that a unit process for generating one table is called an operator.
It is checked if all variables included in the query graph are embodied by one table (step S104). If Yes in step S104, since a combination of values of all variables included in the query graph is embodied, that combination is output as a result. Note that the value of each variable is the OID.
If not all variables included in the query graph are embodied by one table, steps S105 to S110 are repeated until they are embodied.
It is checked in step S105 if a retrieval process using index data stored in the index data storage unit 114 can be made. If a function which is to be used lexical index data, such as “contains” or the like, is available, a high-speed retrieval process can be attained using index data in the structured document DB 111. In this case, a LexicalScanWithTid operator is executed.
It is checked in step S106 in
It is checked in step S107 if identical variables are generated in a plurality of tables. In such case, a Join operator is executed for two tables each.
It is checked in step S108 if all variables, the values of which are to be acquired, are embodied, and only “db( )” which is located at the head of the query data and designates the root of the database remains. In such case, a Nop operator (no operation) is executed.
It is checked in step S109 if the document type TID is assigned to a variable as an upper layer of arbitrary two variables, and the values of these two variables are embodied. In such case, a FilterDocument operator is executed.
It is checked in step S110 if a variable is present in an upper layer of variables, the variables in a lower layer are embodied, and the variable in the upper layer is not embodied. In such case, a ScanAncestorWithTid operator is executed.
In step S111, a result output process is done. In this case, combinations of values (OIDs) that respective variables can assume (a combination of OIDs) are obtained as a table. Each combination includes a plurality of OIDs having an identical document ID and, hence, the combination on the table corresponds to one structured data. By extracting structured document data corresponding to respective document IDs obtained from the combinations on the table, a set of structured document data which match the query data can be obtained.
In the query graph shown in
Respective variables in a query graph shown in
(1) Since the query graph includes a value comparison tag node that can be used a “contains” function, a LexicalScanWithTid operator is executed for a character string “Tanaka”. As a result, a variable node V1 is embodied (Table1 shown in
(2) Likewise, a LexicalScanWithTid operator is executed for a character string “Nakamura”. As a result, a variable node V3 is embodied (Table2 shown in
(3) Since the variables V1 and V3 are embodied, and an upper variable V2 in the query graph is a document type node, a FilterDocument operator is executed. The FilterDocument operator executes an operation for checking the combinations of variable values in two tables (Table1, Table2 in FIGS. 18(a) and 18(b)), and if a document ID (DocID) included in only one of the two tables is found, removing that record from the table. As a result, Table1 and Table2 are obtained, as shown in FIGS. 18(c) and 18(d).
(4) Since the variable V1 is embodied, and the TID of the variable V2 is document {D2}, a parent document acquisition operation can be applied to the variable V1. A GetDocument operator is executed. As a result, the variable V2 is embodied (Table3 shown in
(5) Likewise, a GetDocument operator is executed for the variable V3. As a result, the variable V2 is embodied (Table4 shown in
(6) Since the variable V2 is embodied in different tables (Table3 and Table4), as described in (3) and (4), a Join Operator is executed (
(7) Since a variable V0 is not an output operator, an Nop operator is executed.
In Table1 in
In Table2 in
As shown in
As shown in
As shown in
A combination of values that the variables V1, V2, and V3 can assume (a combination of object IDs) is obtained by the Join operator that joins Table3 and Table4 in association with the variable V2 in
A conventional retrieval process will be described below with reference to
A large difference between
The LexicalScan operator in
The ScanAncestor operator in
By contrast, the GetDocument operator in
The query data shown in
As shown in
As shown in
As described above, according to the above embodiment, a structure template of a hierarchical structure including a plurality of structure elements, each of which has a template ID used to identify that structure element, is stored in the structure template storage unit 113, and a plurality of structured documents each of which includes a plurality of element, which are assigned the template ID of one of the structure elements, are stored in the structured data storage unit 112.
(1) When a retrieval condition which designates a character string and a first structure element which is one of the structure elements of the hierarchical structure and includes that character string is input, structured document data including the first element which includes the character string and has a template ID corresponding to the first structure element are retrieved from the structured documents, and the retrieved structured document data is output.
(2) When a retrieval condition which designates a character string, a first structure element which is one of the structure elements of the hierarchical structure and includes that character string, and a second structure element which is another one of the structure elements of the hierarchical structure and includes the first structure element, is input, structured document data including first element which includes the character string and has a template ID corresponding to the first structure element, and second element which includes the first element and has a template ID corresponding to the second structure element is retrieved from the structured documents, and the retrieved structured document data is output.
(3) When a retrieval condition which designates a character string, a first structure element which is one of the structure elements of the hierarchical structure and includes that character string, a second structure element which is one of the structure elements of the hierarchical structure and includes the first structure element, and a third element which is one of the structure elements of the hierarchical structure and includes the first and second structure elements, is input, structured document data including first element which includes the character string and has a template ID corresponding to the first element, second element which includes the first element and has a template ID corresponding to the second structure element, and third element which includes the first and second element and has a template ID corresponding to the third structure element is retrieved from the structured documents, and the retrieved structured document data is output.
As described above, according to the above embodiment, upon obtaining an object ID set for each structure element designated as a retrieval condition, by using the template ID of each structure element that is included in a structure designated as the retrieval condition, only object IDs including the template ID of that structure element can be selected, thus allowing a high-speed retrieval process.
Each element which is included in each structured document data stored in the structured document data storage unit 112 is assigned an object ID, which includes a document ID used to identify the structured document data including that element, an element ID used to identify that element in the structured document data including the element, and a template ID of a structure element in a structure template corresponding to that element. For this reason, if the object ID of element that satisfies a retrieval condition is obtained, the object ID of its upstream element can be obtained by rewriting the element ID and template ID of that object ID. That is, upstream element in the same structured document data can be obtained without tracing the structure of structured document data. Also, the retrieval range in the structured document DB can be narrowed down in advance on the basis of the template ID and document ID included in the object ID. As a result, a high-speed retrieval process can be achieved.
The method of the present invention described in the embodiment of the present invention can be stored as a program that can be executed by a computer in a recording medium such as a magnetic disk (flexible disk, hard disk, or the like), an optical disk (CD-ROM, DVD, or the like), a semiconductor memory, or the like, and can be distributed.
Number | Date | Country | Kind |
---|---|---|---|
2003-430598 | Dec 2003 | JP | national |