Embodiments described herein relate generally to a structured document management apparatus and a structured document search method.
In the related art, a technique of generating electronic data as a structured document to make it easy to share information and efficiently search information is known. For example, the hyper text markup language (HTML) can express the structure of a document by describing constituent elements of the document, for example, a section title, the body text, or a list structure of a document, using tags. Moreover, the extensible markup language (XML) that can uniquely define tags that express a document structure depending on a purpose is also used. When data is searched for from such a structured document, tags make it easy to identify which data is located at which position in the document. Thus, search performance can be improved.
As a method of displaying the search results on such a structured document, a document summarization technique of automatically generating a summary from sentences in the search results and displaying the summary is known. A keyword-in-context (KWIC) is known as a typical document summarization technique, and according to the KWIC technique, a predetermined number of characters before and after the text that includes a search keyword are extracted from a search target document and are displayed.
Moreover, as another method of displaying the search results on the structured document, a method of displaying section titles corresponding to a document that includes a word identical to a keyword used for search as search results is known.
However, in the case of displaying section titles as the search results, even if a search keyword is identical to a word in the document, when the section titles have a low degree of relevance to the search keyword, the user may not recognize that the information is what the user tries to find. In this case, the user needs to personally read the sentence to check whether the information is relevant to the content that the user wants to find. Thus, there is a need to further improve search convenience.
According to an embodiment, a structured document management apparatus includes a document storage unit, a section title extracting unit, a relevance calculator, a document search unit, a section title selector, and a section title display controller. The document storage unit is configured to store a structured document that includes a plurality of section texts each including a section title and a body text. The section title extracting unit is configured to extract the section titles from the structured document to create a section title list. The relevance calculator is configured to calculate degrees of conceptual relevance between the section title and words included in the section text corresponding to the section title for each of the section texts. The document search unit is configured to search for the section text that includes the word identical to a search keyword. The section title selector is configured to select the section title having a higher degree of relevance with the word identical to the search keyword more preferentially than the section title having a lower degree of relevance with the word identical to the search keyword. The section title display controller is configured to display the selected section title on a display unit as a presentation section title.
Hereinafter, a first embodiment of a structured document management apparatus will be described in detail with reference to the drawings.
When the user powers on the server 1 and the client terminal 3, the CPU 101 activates a program called a loader in the ROM 102 to read a program called an operating system (OS), which manages hardware and software of a computer, from the HDD 104 into the RAM 103, and to activate the OS. Such an OS activates a program and reads and stores information according to an operation of the user. As a typical OS, Windows (registered trademark), UNIX (registered trademark), and the like are known. Programs running on such an OS are called application programs. Application programs are not limited to those running on a predetermined OS, and may be those which cause the OS to take over execution of part of various types of processing described later and those which are included as part of a group of program files that constitutes predetermined application software, an OS, or the like.
Here, the server 1 stores a structured document management program in the HDD 104 as an application program. In this sense, the HDD 104 functions as a storage medium that stores the structured document management program. Moreover, in general, an application program installed in the HDD 104 of the server 1 is provided in a state of being recorded on the storage medium 110 such as media of various schemes, for example, various types of optical disks such as a CD-ROM and a DVD, various types of magneto-optical disks, various types of magnetic disks such as a flexible disk, and semiconductor memories. Thus, the portable storage medium 110 such as an optical information storage medium (for example, a CD-ROM) or a magnetic medium (for example, an FD) can be a storage medium that stores the structured document management program. Further, the structured document management program may be imported from the outside via the communication controller 106 and installed in the HDD 104.
In the server 1, when the structured document management program running on the OS is activated, the CPU 101 intensively controls the respective components by executing various types of arithmetic processing according to the structured document management program. On the other hand, in the client terminal 3, when an application program running on the OS is activated, the CPU 101 intensively controls the respective components by executing various types of arithmetic processing according to the application program. Among various types of arithmetic processing executed by the CPU 101 of the server 1 and the client terminal 3, characteristic processing of the structured document management system according to the embodiment will be described below.
The structured document registration unit 11 registers structured document data input from the input unit 108 and structured document data stored in advance in the HDD 104 of the client terminal 3 in a structured document database (structured document DB) 21 of the server 1, which will be described later. The structured document registration unit 11 sends a storage request to the server 1 together with the structured document data to be registered.
The search unit 12 creates query data that describes search keywords or the like for searching the structured document DB 21 for desired data according to an instruction of the user input from the input unit 108 and sends a search request including the query data to the server 1. Moreover, the search unit 12 receives result data corresponding to the search request sent from the server 1 and displays the result data on the display unit 107.
On the other hand, the server 1 includes a registration unit 22 and a search unit 23 as functional configurations that are realized by the structured document management program. Moreover, the server 1 includes the structured document DB 21 which uses a storage device such as the HDD 104.
The registration unit 22 performs a process of receiving a storage request from the client terminal 3 and storing the structured document data sent from the client terminal 3 in the structured document DB 21. The registration unit 22 includes a storage interface unit 24, a section title extracting unit 25, and a relevance calculator 26.
The storage interface unit 24 receives the input of the structured document data and parses the structured document data sent from the client terminal 3 in order to store the structured document data in the structured document DB 21. Moreover, the storage interface unit 24 assigns an identifier (hereinafter, referred to as an element ID) to elements that appear in data so that the orders of appearance of the elements can be compared, and then, stores the structured document data to which the element ID is assigned in the structured document DB 21 (a structured document data storage unit). The element ID may be manually assigned in advance to the structured document on the client terminal 3 side.
In
Similarly,
The section title extracting unit 25 extracts section titles from the structured document accepted from the storage interface unit 24 and lists the extracted section titles. When section titles are extracted, the text surrounded by the <sectitle> elements within a structured document is recognized as section titles.
Moreover, in the structured document corresponding to the document ID 2, @eid=203, 206, and 212 are respectively extracted for section texts indicated by the element IDs 202, 205, and 211 as section titles. Further, two section titles of @eid=206 and 209 are extracted for a section text indicated by the element ID 208. In the structured document corresponding to the document ID 2, not only the section title of @eid=209 surrounded by the <sec> tags of its own, but also the section title of @eid=206 on the parent layer is also extracted as the section titles of the section text indicated by the element ID 208. In this embodiment, a child text is a section text defined by the <sec> element on the child layer within the <sec> element that defines a section text on the parent layer. In the structured document illustrated in
The section title extracting unit 25 stores the generated section title list in the structured document DB 21 and delivers the section title list to the relevance calculator 26. The relevance calculator 26 calculates the degrees of relevance between the section titles extracted by the section title extracting unit 25 and the words included in the corresponding section text. A concept dictionary illustrated in
The relevance calculator 26 extracts words from respective section titles and calculates the degrees of relevance between the extracted words and the words in the body text. An existing word extracting method can be used; and words in a concept dictionary are recognized and extracted from the text herein. For example, two words “LAN” and “wireless LAN” are extracted as words from the section title “troubleshooting of wireless LAN” defined at @eid=116. On the other hand, words “LAN”, “wireless LAN”, “router”, and “access point” are extracted from the body text defined at @eid=115 of the section text. In this case, the degrees of relevance between the respective words and each of the words in the section title are calculated. The degrees of relevance between the words “LAN”, “wireless LAN”, “router”, and “access point” and the word “LAN” are “1.0”, “0.333”, “0.333” and “0.333”, respectively, and the degrees of relevance between the words “LAN”, “wireless LAN”, “router”, and “access point” and the word “wireless LAN” are “0.333”, “1.0”, “0.25”, and “0.25”, respectively. In this case, since the higher degrees of relevance for the respective words are used preferentially, the degrees of relevance between the words in the section text corresponding to @eid=115 and the words in the section text corresponding to @eid=116 are “1.0”, “1.0”, “0.333”, and “0.333”. The relevance calculator 26 performs this calculation with respect to each combination of section titles and section texts and stores the calculation results in the structured document DB 21 as a title word relevance table 28 illustrated in
Returning to
The search interface unit 29 receives the input of a search keyword and calls the referring unit 30 in order to obtain data that includes a word that is identical to a search keyword designated by query data that includes the received search keyword.
The referring unit 30 accesses the structured document DB 21 to search structured documents that include the search keyword designated by the query data from structured document data 27 and sends a list of section texts that include a word identical to the search keyword to the section title selector 31. For example, when the search keyword is “wireless LAN”, @eid=109, 102, 106, 112, and 115 of the document ID 1 and @eid=202, 205, 208, and 211 of the document ID 2 are hit as the section texts, and the search results are sent to the section title selector 31.
The section title selector 31 selects section titles which have the higher degrees of relevance with the word that is identical to the search keyword more preferentially than section titles which have the lower degrees of relevance and delivers the selection results to the search interface unit 29. As a method of preferentially selecting section titles which have the higher degrees of relevance, a method of not selecting section titles which have small degrees of relevance and selecting only section titles of which the degrees of relevance are on the higher rank may be used. Specifically, first, the section title selector 31 examines, from the title word relevance table 28, the degrees of relevance between the section titles of the respective hit section texts and the word that is identical to the search keyword. As for the search keyword “wireless LAN”, section titles of which the degrees of relevance are higher than “0” are @eid=110 and 116 for the document ID 1, and the section title selector 31 acquires these degrees of relevance. The section title selector 31 selects the top N (for example, two) of the acquired degrees of relevance to determine section titles that are to be displayed in the search results as display section titles. In this case, the section title @eid=110 corresponding to the element ID @eid=109 of the section text of the document ID 1 and the section title @eid=116 corresponding to the element ID @eid=115 of the section text are selected. Moreover, the section title @eid=206 corresponding to the element ID @eid=205 of the section text of the document ID 2 and the section title @eid=209 corresponding to the element ID @eid=208 of the section text are selected. The section title selector 31 sends the selection results to the search interface unit 29.
The search interface unit 29 outputs the section titles received from the section title selector 31 to the display unit 107 so that the section titles are displayed.
As another example of the display screen, a display screen illustrated in
The flow of processes of registering and searching structured documents according to this embodiment will be described with reference to
Next, the flow of the process of calculating the degree of relevance between section titles and words in the body text will be described with reference to
Next, the flow of the process in which the section title selector 31 selects section titles during search will be described with reference to
In the structured document management apparatus according to this embodiment, when a section text that includes a word that is identical to the keyword used for search is present, section titles having a high degree of relevance with the search keyword are displayed preferentially. Thus, the user can easily determine whether the information that the user wants to find is included in the document from the presentation section title. When the presentation section title is used, the user does not need to personally read the sentences to determine whether the sentences are close to the content that the user wants to find and thus can immediately understand the location in the structured document at which the information that the user wants to find is located.
The section title selector 31 may select section title having a predetermined degree of relevance or higher rather than selecting the top N section titles having the higher degrees of relevance. Moreover, the section title selector 31 may select the top N section titles which have a predetermined degree of relevance or higher.
Further, the configuration in which when displaying presentation section titles on the display unit, the section titles are sorted in the order in which the section titles are displayed within the structured document, or the top section titles are displayed first is not essential.
Furthermore, the type of tags that defines section titles and the body text is not limited to that of this embodiment but can be freely set.
Next, a second embodiment of a structured document management apparatus will be described with reference to
The section title selector 31 determines whether the degrees of relevance have been calculated for the section titles of all section texts that each include the word identical to the search keyword (step S403). When the degrees of relevance for all section texts have been calculated (Yes in step S403), the section title selector 31 sorts the section titles of the section texts that each include the word identical to the search keyword in descending order of the degrees of relevance (step S404). On the other hand, when it is determined that the degrees of relevance for all section texts that each include the word identical to the search keyword have not been calculated (No in step S403), the process of step S402 is repeated. The section title selector 31 selects the top N section titles having the higher degrees of relevance and sorts the section titles in the appearance order in which the section titles appear in the structured document (step S405). Moreover, the section title selector 31 determines whether the section titles of all structured documents (in this embodiment, two documents having the document IDs 1 and 2) have been selected (step S406). When the section titles of all structured documents have been selected (Yes in step S406), the section title selector 31 sends the section titles selected and sorted in step S305 to the search interface unit 29 as presentation section titles (step S407) and ends the process. When the section titles of all structured documents have not been selected (No in step S406), the processes starting with step S401 are repeated.
In this embodiment, since it is not necessary to calculate the degrees of relevance between section titles and words in the body text in advance, the structured document management apparatus may be used even when it is not possible to secure a storage capacity for storing calculation results. Moreover, since it is only necessary to calculate the degrees of relevance between a search keyword and section titles in a section text that includes a word identical to the search keyword, it is possible to suppress the time required for calculation.
While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed, the novel embodiments described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the embodiments described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the inventions.
Number | Date | Country | Kind |
---|---|---|---|
2012-057240 | Mar 2012 | JP | national |
This application is a continuation of PCT international application Ser. No. PCT/JP2012/068505 filed on Jul. 20, 2012 which designates the United States, incorporated herein by reference, and which claims the benefit of priority from Japanese Patent Application No. 2012-057240, filed on Mar. 14, 2012, the entire contents of which are incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/JP2012/068505 | Jul 2012 | US |
Child | 13845878 | US |